Skip to content

Conversation

@ggerganov
Copy link
Member

fix #14695 (comment)

  • Remove slot_params.ignore_eos
  • When ignore_eos is passed in the request, we ignore all EOG tokens by adding -INF logit bias to the sampler

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes seem correct to me and I can confirm that I am now getting the expected numbers of tokens when setting ignore_eos.

@ggerganov ggerganov merged commit 538cc77 into master Jul 16, 2025
52 of 56 checks passed
Comment on lines +476 to +481
for (llama_token i = 0; i < llama_vocab_n_tokens(vocab); i++) {
if (llama_vocab_is_eog(vocab, i)) {
//SRV_DBG("%s: added %s logit bias = %f\n", __func__, common_token_to_piece(ctx, i).c_str(), -INFINITY);
params.sampling.logit_bias.push_back({i, -INFINITY});
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is done for every token during generation, I suspect it is going to have a significant performance impact.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done once per completion request, at the beginning, upon processing the input json parameters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If performance is a concern we could maybe provide a list of EoG tokens to iterate over instead of iterating over all tokens and checking whether each one is EoG. Although I think iterating over all tokens once per request is going to be negligible vs. iterating over all tokens once per generated token as is being done for sampling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my system this takes about 0.3 ms for a 150k vocab model, so I suppose it is not that bad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this anyway: #14721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants