Count of tokens seen during training in Trainer

### Feature request

The `Trainer` API should track and log the number of tokens seen during training.

While it sometimes could (maybe?) be possible to back out the number of tokens seen from the FLOS, or by iterating over the whole dataset, it would make a lot of sense for the Trainer API to track the number of tokens seen (and it shouldn't be necessary to completely iterate over a model's training loop just to compute the count of tokens, which is the only current implementation of any token-related metric in Trainer, [`Trainer.num_tokens()`](https://github.com/huggingface/transformers/blob/acc394c4f5e1283c19783581790b3dc3105a3697/src/transformers/trainer.py#L1180)).

This can't currently be implemented in a CallBack, because callbacks don't have access to the training data (only the trainer state).

### Motivation

Number of tokens seen is an essential metric tracked in nearly every LLM training run. It is widely considered one of the fundamental drivers of model quality (tokens seen during training is reported for nearly every major LLM release). It seems that any language model developer using Hugging Face would like to know this metric for their training runs -- it maybe even more important and useful than the FLOS, and perhaps as important as the number of gradient steps. 

In any case, it's an extremely useful number to have, and it must be tracked during training as the model consumes examples.

### Your contribution

I'm willing to contribute this but would like some guidance on the overall design first.

In particular, here's what I think a reasonable implementation would include:

- Add a `global_tokens_seen` or similar to the `TrainerState`. This would add only a single integer value to the `TrainerState`.
- Increment this during `Trainer._inner_training_loop()`
- Probably add this information to the logging outputs

What do the folks at HF think about that?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Count of tokens seen during training in Trainer #27027

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Count of tokens seen during training in Trainer #27027

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions