[Bug fix] Using loaded checkpoint with --do_predict (instead of random init) #3437

ethanjperez · 2020-03-25T20:23:47Z

Without this fix, I'm getting near-random validation performance for a trained model, and the validation performance differs per validation run. I think this happens since the model variable isn't set with the loaded checkpoint, so I'm using a randomly initialized model. Looking at the model activations, they differ each time I run evaluation (but they don't with this fix).

Without this fix, I'm getting near-random validation performance for a trained model, and the validation performance differs per validation run. I think this happens since the `model` variable isn't set with the loaded checkpoint, so I'm using a randomly initialized model. Looking at the model activations, they differ each time I run evaluation (but they don't with this fix).

ethanjperez · 2020-03-25T20:27:41Z

Tagging @srush @nateraw from the original Lightning GLUE PR to check I'm not missing something?

codecov-io · 2020-03-25T20:32:22Z

Codecov Report

Merging #3437 into master will increase coverage by 0.04%.
The diff coverage is 88.88%.

@@            Coverage Diff             @@
##           master    #3437      +/-   ##
==========================================
+ Coverage   77.56%   77.60%   +0.04%     
==========================================
  Files         100      100              
  Lines       16970    16967       -3     
==========================================
+ Hits        13162    13167       +5     
+ Misses       3808     3800       -8

Impacted Files	Coverage Δ
src/transformers/data/processors/utils.py	`24.68% <88.88%> (+2.94%)`	⬆️
src/transformers/modeling_utils.py	`91.85% <0.00%> (+0.13%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 83272a3...f12d585. Read the comment docs.

nateraw · 2020-03-25T20:34:55Z

I'll check this out later tonight! I'm on mobile so I've just looked at your commit quickly...looks like you're right. I know in the past I've instantiated the model then called model.load_from_checkpoint(loaded_ckpt) so what you've got probably gets the same job done. The benefit of doin it the way I just mentioned is that if you already have a model object available from training, you can just load the best ckpt into that. Either way works though!

nateraw · 2020-03-25T20:47:12Z

That was fast 😄 Looks good to me!

ethanjperez · 2020-03-25T21:18:40Z

Thanks for checking :) I'm still not able to reproduce my in-training validation performance though with the --do_predict flag, any ideas? I'm getting identical validation accuracy on different runs now, but the accuracy is still near random

nateraw · 2020-03-27T04:25:09Z

@ethanjperez I just checked the docs, and it looks like the way we were doing it originally was correct.

model = MyLightingModule.load_from_checkpoint(PATH)
model.eval()
y_hat = model(x)

The way that I was explaining to do it would require you to use torch.load on the checkpoint path, which you would then pass to model.load_state_dict. The above method (what we had originally) is probably supposed to do that for you.

I haven't had the chance to recreate the issue, so I'll have to take a look.

ethanjperez · 2020-03-27T14:07:59Z

Cool thanks! Even with the original way, I was still not able to reproduce my in-training validation performance (just something to look out for when you try) - In particular, I'm loading/running an already trained model with the --do_predict flag without using the --do_train flag (I don't think you'd see the issue if you use both --do_predict and --do_train)

sshleifer

Great catch!

ethanjperez · 2020-04-03T21:55:08Z

@nateraw @sshleifer Are you guys able to load a trained model successfully with the pytorch-lightning scripts? Even after this patch, I am having issues loading an already trained model, i.e., if I just use --do_eval without also using --do_train

sshleifer · 2020-04-16T14:59:10Z

Sorry for taking so long. I will try to reproduce this today if there is no update on your end!

Filing an issue with what you ran/expected would help :) @ethanjperez

ethanjperez · 2020-06-22T21:14:17Z

@sshleifer Just seeing this - were you able to reproduce the issue? I can't remember what exact command I ran, but it was a standard evaluation command (the same as the training command I used, but with a few flags tweaked, e.g. drop the --do-train flag and add the --do-eval flag)

sshleifer · 2020-06-23T13:28:24Z

This is fixed now.

ethanjperez added 2 commits March 25, 2020 15:45

Update checkpoint loading

f778f08

Fixing model loading

88474aa

ethanjperez mentioned this pull request Mar 30, 2020

Tests for more examples #3483

Closed

sshleifer approved these changes Mar 30, 2020

View reviewed changes

sshleifer merged commit e5c393d into huggingface:master Mar 30, 2020

[Bug fix] Using loaded checkpoint with --do_predict (instead of random init) #3437

[Bug fix] Using loaded checkpoint with --do_predict (instead of random init) #3437

Uh oh!

Conversation

ethanjperez commented Mar 25, 2020

Uh oh!

ethanjperez commented Mar 25, 2020

Uh oh!

codecov-io commented Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nateraw commented Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nateraw commented Mar 25, 2020

Uh oh!

ethanjperez commented Mar 25, 2020

Uh oh!

nateraw commented Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethanjperez commented Mar 27, 2020

Uh oh!

sshleifer left a comment

Choose a reason for hiding this comment

Uh oh!

ethanjperez commented Apr 3, 2020

Uh oh!

sshleifer commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethanjperez commented Jun 22, 2020

Uh oh!

sshleifer commented Jun 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Mar 25, 2020 •

edited

Loading

nateraw commented Mar 25, 2020 •

edited

Loading

nateraw commented Mar 27, 2020 •

edited

Loading

sshleifer commented Apr 16, 2020 •

edited

Loading