Skip to content

Conversation

@muellerzr
Copy link
Contributor

What does this PR do?

This PR adds num_tokens_seen to the TrainerState allowing users to know how many tokens were passed in an individual batch. Uses gather to ensure that in DDP this can be known as well.

Fixes #27027

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@pacman100 @amyeroberts

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 3, 2023

The documentation is not available anymore as the PR was closed or merged.

@muellerzr
Copy link
Contributor Author

@amyeroberts question: these failures are all because they cannot find the main_input_name in the batch. Does this mean some of the models are wrong?

For instance, one of the failures in the examples script is from SpeechEncoderDecoderModel, which states main_input_name is "inputs", however when running the script it's using `"input_values":

self = {'input_values': tensor([[-0.0177, -0.0188, -0.0202,  ..., -0.0032, -0.0068, -0.0039],
        [-0.0196, -0.0556, -0.0...-100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100]])}

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Main comment is about whether this should be running by default

@amyeroberts
Copy link
Contributor

@muellerzr I think we don't have to worry about that if it's behind a flag in the training args. This way - if someone want to measure it then they have to make sure that main_input_name is properly set, but it won't break trainer for models which are currently compatible but don't have this correctly set.

@muellerzr muellerzr requested a review from amyeroberts November 7, 2023 20:30
Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding and iterating!

Just some nits on the implementation

@muellerzr
Copy link
Contributor Author

@amyeroberts failing tests seem to be unrelated

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Co-authored-by: amyeroberts <[email protected]>
@muellerzr muellerzr merged commit 2fc33eb into main Nov 14, 2023
@muellerzr muellerzr deleted the tokens-seen branch November 14, 2023 20:31
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* Add tokens seen

* Address comments, add to TrainingArgs

* Update log

* Apply suggestions from code review

Co-authored-by: amyeroberts <[email protected]>

* Use self.args

* Fix docstring

Co-authored-by: amyeroberts <[email protected]>

---------

Co-authored-by: amyeroberts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Count of tokens seen during training in Trainer

3 participants