-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Coalesce text diffs in streaming requests. #4923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
057bf0f
to
3bc85f8
Compare
@pathorn Thanks for your contribution, did you see this token repetition issue on chat API as well? I think we can try to figure out the root cause within a time window, and if we cannot fix it presumably before the next release, then we can merge this PR as a WAR. thanks again for reporting the issue and giving a solution. |
@LinPoly We observed this issue happening for both chat and completion API. We suspect the output object is being updated even while the post_processor is running as we are seeing the create_logprobs_completion crash because the lengths do not match. |
@Shang-Pin Do you happen to have reproduction script for chat or completion endpoint? Thanks! |
c33bd18
to
422162d
Compare
Signed-off-by: Patrick Reiter Horn <[email protected]>
422162d
to
b179fff
Compare
@LinPoly It usually happens under high load when the fast_api server might be slow to catch up. We don't have a script to reliably produce it. But if you enable logprobs and run the engine under high load, it will usually happen. |
@Shang-Pin Thanks for the info, so I think there are two separate issues:
Not sure if they have the same root cause, now we have a fix PR for the first issue, it does fix a concurrent update bug in completion API, as both of you @Shang-Pin @pathorn suggested. I will check the 2nd issue, and it would be great if you can check the PR to see if it makes sense to you. Thanks again for using and helping improve TRTLLM. |
@pathorn , @Shang-Pin , |
Sorry, we have not had a chance to test the above change (which was already merged), and even if we did, the production environment has changed and there is no guarantee that we will hit the conditions necessary to reproduce the original issue. Given that this PR was a hack, I think that it does not make sense to continue with this PR for now, unless we have evidence that the issue still exists. |
Then, I'll close this PR for now. Please feel free to open new one if needed~ |
fix/hack: Coalesce text diffs in streaming requests.
Description
Sometimes, streaming output from openai_server will produce the same message twice and skip another message.
For example, this was a packet capture from an example bad request:
138\r\ndata: {"id":"cmpl-ef21ea5a4fe640f1bea24729b7a0b07d","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":"1. ","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":160}}\n\n\r\n",
138\r\ndata: {"id":"cmpl-6b5546146cdf41f7a0b64b2de0309288","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":" Clarify","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":161}}\n\n\r\n",
...
"139\r\ndata: {"id":"cmpl-1ff642a3baa14acf980a8632e744f46c","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5fc9048a4a9f424b9518c976083a369f","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5782671ed8a54e4383c0ddbac6a82b68","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-cd64d34f1091405797f53f370064e63b","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}
In particular, completion_tokens is calculated using the current value of
output.length
, and since output is the same object at each iteration, this will always be the current output length, which is why it always says 172 after the lag spike in the above example.As for
text_diff
, it uses an internal property of the GenerationResultBase that stores the last position. My suspicion is this last position is increased even if the generator is not being consumed, which is why there is data loss and duplicate tokens. To solve this, I keep track of the last text pos in the params object which is local to the request, and pass it into the GenerationResultBase getter. Finally, this would create a situation where you have one packet with the entire text diff, followed by a bunch of empty updates, so I added a hacky check to remove the empty updates.This change is a little bit of a hack. It doesn't address the root cause of state updates being interleaved with streaming generator polling, but it kind of prevents these missed packets from producing corrupted output in the frontend. I would prefer a better change, but this is what I was able to come up with.
Test Coverage
Run DeepSeek-R1-FP4
Run several extremely large prefills while performing streaming requests, to create high load. If you time it right, without this patch it will skip one packet and produce a duplicate with the same completion_tokens on each.
(Also, for some reason, I was unable to reproduce the issue with curl, but only with python aiohttp. I don't know why.)