Skip to content

WanDB callback fails on training end when eval dataset is provided #34701

@eyalmazuz

Description

@eyalmazuz

System Info

  • transformers version: 4.46.2
  • Platform: Linux-5.14.0-427.22.1.el9_4.x86_64-x86_64-with-glibc2.34
  • Python version: 3.11.10
  • Huggingface_hub version: 0.26.1
  • Safetensors version: 0.4.5
  • Accelerate version: 1.1.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no

Who can help?

@muellerzr @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

(I reduced the code to the relevant parts)

    train_args = TrainingArguments(
        num_train_epochs=50,
        eval_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=3,
        report_to="wandb",
        run_name=name,
    )

    trainer = Trainer(
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

The issue is when reporting to WanDB, the callback at the following line of code

fake_trainer = Trainer(args=args, model=model, processing_class=tokenizer)

creates a fake trainer

fake_trainer = Trainer(args=args, model=model, processing_class=tokenizer)

with the same as the training arguments
but it isn't providing any datasets to the fake trainer
but because my script defines eval_strategy to anything other than no, and because WanDB reporting is defined

it throws the following error at the end of the training

105   File "/home/mazuze/NLP/Hebrew-LLM-Eval/sentence_ordering/train_model.py", line 278, in main
106     trainer.train()
107   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
108     return inner_training_loop(
109            ^^^^^^^^^^^^^^^^^^^^
110   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer.py", line 2635, in _inner_training_loop
111     self.control = self.callback_handler.on_train_end(args, self.state, self.control)
112                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
113   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer_callback.py", line 471, in on_train_end
114     return self.call_event("on_train_end", args, state, control)
115            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
116   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer_callback.py", line 518, in call_event
117     result = getattr(callback, event)(
118              ^^^^^^^^^^^^^^^^^^^^^^^^^
119   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 919, in on_train_end
120     fake_trainer = Trainer(args=args, model=model, processing_class=tokenizer)
121                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
122   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
123     return func(*args, **kwargs)
124            ^^^^^^^^^^^^^^^^^^^^^
125   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer.py", line 418, in __init__
126     raise ValueError(
127 ValueError: You have set `args.eval_strategy` to IntervalStrategy.EPOCH but you didn't pass an `eval_dataset` to `Trainer`. Either set `args.eval_strategy` to `no` or pass an `eval_dataset`.

Expected behavior

To not throw an exception and run the "on training end" successfully

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions