-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Closed
Description
System Info
This example, i.e., transformers/examples/pytorch/text-classification/run_glue_no_trainer.py has some bugs related to resume from check point, to be more specific, below are the original code adopted from line 520-521 and line 530-534
accelerator.print(f"Resumed from checkpoint: {checkpoint_path}")
accelerator.load_state(path)
# need to multiply `gradient_accumulation_steps` to reflect real steps
resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps
starting_epoch = resume_step // len(train_dataloader)
resume_step -= starting_epoch * len(train_dataloader)
completed_steps = resume_step // args.gradient_accumulation_step
However, it should be,
accelerator.print(f"Resumed from checkpoint: {checkpoint_path}")
accelerator.load_state(checkpoint_path)
where we should load from checkpoint_path instead of path, and
# need to multiply `gradient_accumulation_steps` to reflect real steps
resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps
starting_epoch = resume_step // len(train_dataloader)
completed_steps = resume_step // args.gradient_accumulation_steps
resume_step -= starting_epoch * len(train_dataloader)
we should move completed_steps = resume_step // args.gradient_accumulation_steps up, before we change the resume_step.
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
See above
Expected behavior
I have provided the bugs and how to fix them, thanks a lot!
Metadata
Metadata
Assignees
Labels
No labels