Skip to content

Count of tokens seen during training in Trainer #27027

@jpgard

Description

@jpgard

Feature request

The Trainer API should track and log the number of tokens seen during training.

While it sometimes could (maybe?) be possible to back out the number of tokens seen from the FLOS, or by iterating over the whole dataset, it would make a lot of sense for the Trainer API to track the number of tokens seen (and it shouldn't be necessary to completely iterate over a model's training loop just to compute the count of tokens, which is the only current implementation of any token-related metric in Trainer, Trainer.num_tokens()).

This can't currently be implemented in a CallBack, because callbacks don't have access to the training data (only the trainer state).

Motivation

Number of tokens seen is an essential metric tracked in nearly every LLM training run. It is widely considered one of the fundamental drivers of model quality (tokens seen during training is reported for nearly every major LLM release). It seems that any language model developer using Hugging Face would like to know this metric for their training runs -- it maybe even more important and useful than the FLOS, and perhaps as important as the number of gradient steps.

In any case, it's an extremely useful number to have, and it must be tracked during training as the model consumes examples.

Your contribution

I'm willing to contribute this but would like some guidance on the overall design first.

In particular, here's what I think a reasonable implementation would include:

  • Add a global_tokens_seen or similar to the TrainerState. This would add only a single integer value to the TrainerState.
  • Increment this during Trainer._inner_training_loop()
  • Probably add this information to the logging outputs

What do the folks at HF think about that?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions