Skip to content

Conversation

@null-pointer-access
Copy link
Contributor

What does this PR do?

In the current GPT2 implementation, the LMHead module processes all tokens during prefill, even though only the final token’s output is used for generation. This PR aligns the behavior with LlamaModel by computing the LMHead output only for the last token, reducing unnecessary computation during prefill.

Fixes #38977

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, looks good to me!

Would be nice to do a pass over old common models and update them as well, we will leave it to later PRs :)

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) June 25, 2025 08:17
@zucchini-nlp zucchini-nlp merged commit 7b38073 into huggingface:main Jun 25, 2025
20 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LMHead is processing redundant tokens in prefill

4 participants