-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Description
Feature request
The Trainer API should track and log the number of tokens seen during training.
While it sometimes could (maybe?) be possible to back out the number of tokens seen from the FLOS, or by iterating over the whole dataset, it would make a lot of sense for the Trainer API to track the number of tokens seen (and it shouldn't be necessary to completely iterate over a model's training loop just to compute the count of tokens, which is the only current implementation of any token-related metric in Trainer, Trainer.num_tokens()).
This can't currently be implemented in a CallBack, because callbacks don't have access to the training data (only the trainer state).
Motivation
Number of tokens seen is an essential metric tracked in nearly every LLM training run. It is widely considered one of the fundamental drivers of model quality (tokens seen during training is reported for nearly every major LLM release). It seems that any language model developer using Hugging Face would like to know this metric for their training runs -- it maybe even more important and useful than the FLOS, and perhaps as important as the number of gradient steps.
In any case, it's an extremely useful number to have, and it must be tracked during training as the model consumes examples.
Your contribution
I'm willing to contribute this but would like some guidance on the overall design first.
In particular, here's what I think a reasonable implementation would include:
- Add a
global_tokens_seenor similar to theTrainerState. This would add only a single integer value to theTrainerState. - Increment this during
Trainer._inner_training_loop() - Probably add this information to the logging outputs
What do the folks at HF think about that?