-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[gpt-oss] Fix: No response when using stream & tools #24473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an issue where gpt-oss models stop responding during frequent tool usage by correcting the prompt format. The changes adjust how tool call messages are constructed and rendered. My review identified a critical bug in the new tool call formatting which omits a required part of the syntax, and a high-severity performance issue from redundant encoding operations inside a loop. Addressing these points will help ensure the fix is both correct and efficient.
37bcd77 to
ff9b425
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
ff9b425 to
3e08066
Compare
3e08066 to
65ec373
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
31cb0d7 to
14ea651
Compare
vllm/entrypoints/harmony_utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be a valid channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The harmony library function failed to add data properly, so it was temporarily added to the channel. If not added, the data would be passed to gpt-oss in the 'before' data format, and as this content accumulates, errors occur in the output structure.
use:
width_recipient(f"functions.{name}")
data:
<|start|>assistant to=functions.{name}<|channel|>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As errors accumulate, the following data will be output:
gpt-oss fail e.g.
"<|start|>assistant<|channel|>commentary to=functions.name>{}<|call|>"
"<|start|>assistant<|channel|>analysis to=functions.file_read <|constrain|>json<|message|>{}<|call|>"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sounds more like an issue in harmony library. could you raise an issue there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a unit test for the encoding result?
|
Also CC @yeqcharlotte |
vllm/entrypoints/harmony_utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will evaluate to false for chats with no assistant messages yet, preventing
Calls to these tools must go to the commentary channel: 'functions'.
from being added to the system prompt for requests that define custom tools in the functions namespace. This leads to the same issue of the model putting tool calls in the analysis channel, often using incorrect arguments or forgetting to define a tool recipient.
I think you want to check for a developer message that includes a functions namespace definition instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you suggested, I modified it to check tool usage at msg.author.role == Role.DEVELOPER, so it can be verified without the first message. Thank you!
79ccd98 to
1978a5d
Compare
1978a5d to
97ff3d0
Compare
yeqcharlotte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix @levunet! could you share some example prompts that show the results before and after the fix? that reproduces your errors?
also any with tool eval result with aime or gpqa on it?
cc: @alecsolder
vllm/entrypoints/harmony_utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sounds more like an issue in harmony library. could you raise an issue there?
vllm/entrypoints/harmony_utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a unit test for the encoding result?
5aa2ee2 to
7d3b847
Compare
|
@yeqcharlotte I've completed the rebase and fixed some bugs. I'll request another review after creating the unit tests. |
fa9222b to
19a14e7
Compare
19a14e7 to
b49586a
Compare
This commit addresses an issue where the harmony library passes data to the model in a different format than what the model outputs, causing the model to become confused and stop responding when using /v1/chat/completions with stream and tools. The fix updates the message format to match the model's expected output: - Move recipient info from assistant start tag to channel tag - Change content type from 'json' to '<|constrain|>json' - Replace <|end|> token with <|call|> token for tool calls This is a temporary fix until the underlying format mismatch is properly resolved. Signed-off-by: kyt <[email protected]>
b49586a to
c285452
Compare
|
I tried to add unit tests but couldn't because I found it difficult to verify errors that occur when the model tries to use tools creatively based on its temperature setting. My current PR cannot be considered a perfect fix for the gpt-oss tool usage bug. Still, if you lower the temperature to around 0.6, it definitely uses tools better than before, but I don't think it's good code. @yeqcharlotte What do you think would be the best way to handle this PR..? |
|
Hey @levunet, a lot of the things in here are things I have noticed as well Header element orderingI have noticed that the harmony library renders elements in headers in a different order than the model outputs as well. They mention it in one place in their documentation here
The Harmony library is opinionated and always renders them in one order, which also happens to be different than how the model usually outputs the headers. But from my testing that has not impacted tool calling consistency. Unexpected tokens in harmony header
I'd have to look again for specifics in the rust code, but the Harmony library parses headers in a very specific way (generally)
I have seen this behavior before, and it was actually not due to tokenization/formatting and was because of there being an issue in the specific attention backend that was being used to run the model. If you can provide information on the configuration + hardware you're using when you're seeing this error that can help in case it is the same one! It was only an issue on certain hardware types. Overall changesI do not believe vLLM should adjust render_for_completion functionality for Harmony, it is actually pretty complicated in their library and handles things like removing certain reasoning messages between generations, which would need to be re-implemented again in python. If there are true bugs in their renderer, then it should probably be an issue and PR to the harmony repo. I do think the following change is good though, and will help for function calling on chat completions API route in general https://github.com/vllm-project/vllm/pull/24473/files#diff-f3135631994e5e8f63fff36f5fb493f404a7c253c004183613a007548156e558R1573 I'm hoping that most of the issues you're seeing are caused by hardware issues, because there should be flags which can resolve them. |
@alecsolder ==============================
|
|
Thank you! If you want to try, you can try running with If you get the same issue still then someone else may know more! Also, I do think that the with_custom_tools change is needed for the chat completions endpoint, so if you want to make a PR just for that, I think it makes sense! |
|
@alecsolder |
|
@alecsolder vllm option "--model openai/gpt-oss-20b --max-model-len 60000 --gpu-memory-utilization 0.6 --tool-call-parser openai --reasoning-parser openai_gptoss --enable-auto-tool-choice --max-num-batched-tokens 512 --host 0.0.0.0 --port 8000" |
|
The 'tool_calls not found' error occurs with approximately 50% probability. |
|
@alecsolder |




Purpose
I've confirmed that when using the harmony library, we are passing data to the model in a different format than what the model outputs. Therefore, this PR temporarily fixes this issue to resolve the problem where the model gets confused and stops responding when using /v1/chat/completions with stream and tools.
before:
<|start|>assistant to=functions.project_tree<|channel|>commentary json<|message|>{"max_depth": 10}<|end|>
after:
<|start|>assistant<|channel|>commentary to=functions.project_tree json<|message|>{"max_depth": 10}<|call|>
Reference material: https://cookbook.openai.com/articles/openai-harmony#preambles
Test Plan
Test Result