Skip to content

eval_loss not found when training a peft model using trainer.py #33420

@ChintanShahDS

Description

@ChintanShahDS

System Info

transformers version: 4.43.3
Python 3.10.12
Ubuntu

Issue is with the trainer.py since it does not check the base_model for Peft cases to get the label information

Have fixed in local but raising this to fix in the branch

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run a vision transformer training using Lora

Expected behavior

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1341, in _save_checkpoint
metric_value = metrics[metric_to_check]
KeyError: 'eval_loss'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/vision/run_image_classification.py", line 510, in
main()
File "/home/vision/run_image_classification.py", line 480, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1052, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, _grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1269, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1343, in _save_checkpoint
raise KeyError(
KeyError: "The metric_for_best_model training argument is set to 'eval_loss', which is not found in the evaluation metrics. The available evaluation metrics are: ['eval_runtime', 'eval_samples_per_second', 'eval_steps_per_second', 'epoch', 'memory_allocated (GB)', 'max_memory_allocated (GB)', 'total_memory_available (GB)']. Consider changing the metric_for_best_model via the TrainingArguments."

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions