Skip to content

Resume from checkpoint functionality of the Pytorch example has some bugs and need to fixed. #25998

@MingxuanZhangPurdue

Description

@MingxuanZhangPurdue

System Info

This example, i.e., transformers/examples/pytorch/text-classification/run_glue_no_trainer.py has some bugs related to resume from check point, to be more specific, below are the original code adopted from line 520-521 and line 530-534

accelerator.print(f"Resumed from checkpoint: {checkpoint_path}")
accelerator.load_state(path)
# need to multiply `gradient_accumulation_steps` to reflect real steps
resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps
starting_epoch = resume_step // len(train_dataloader)
resume_step -= starting_epoch * len(train_dataloader)
completed_steps = resume_step // args.gradient_accumulation_step

However, it should be,

accelerator.print(f"Resumed from checkpoint: {checkpoint_path}")
accelerator.load_state(checkpoint_path)

where we should load from checkpoint_path instead of path, and

# need to multiply `gradient_accumulation_steps` to reflect real steps
resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps
starting_epoch = resume_step // len(train_dataloader)
completed_steps = resume_step // args.gradient_accumulation_steps
resume_step -= starting_epoch * len(train_dataloader)

we should move completed_steps = resume_step // args.gradient_accumulation_steps up, before we change the resume_step.

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

See above

Expected behavior

I have provided the bugs and how to fix them, thanks a lot!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions