Coalesce text diffs in streaming requests. #4923

pathorn · 2025-06-05T00:37:49Z

fix/hack: Coalesce text diffs in streaming requests.

Description

Sometimes, streaming output from openai_server will produce the same message twice and skip another message.
For example, this was a packet capture from an example bad request:

138\r\ndata: {"id":"cmpl-ef21ea5a4fe640f1bea24729b7a0b07d","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":"1. ","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":160}}\n\n\r\n",
138\r\ndata: {"id":"cmpl-6b5546146cdf41f7a0b64b2de0309288","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":" Clarify","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":161}}\n\n\r\n",
...
"139\r\ndata: {"id":"cmpl-1ff642a3baa14acf980a8632e744f46c","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5fc9048a4a9f424b9518c976083a369f","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5782671ed8a54e4383c0ddbac6a82b68","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-cd64d34f1091405797f53f370064e63b","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}

In particular, completion_tokens is calculated using the current value of output.length, and since output is the same object at each iteration, this will always be the current output length, which is why it always says 172 after the lag spike in the above example.

As for text_diff, it uses an internal property of the GenerationResultBase that stores the last position. My suspicion is this last position is increased even if the generator is not being consumed, which is why there is data loss and duplicate tokens. To solve this, I keep track of the last text pos in the params object which is local to the request, and pass it into the GenerationResultBase getter. Finally, this would create a situation where you have one packet with the entire text diff, followed by a bunch of empty updates, so I added a hacky check to remove the empty updates.

This change is a little bit of a hack. It doesn't address the root cause of state updates being interleaved with streaming generator polling, but it kind of prevents these missed packets from producing corrupted output in the frontend. I would prefer a better change, but this is what I was able to come up with.

Test Coverage

Run DeepSeek-R1-FP4
Run several extremely large prefills while performing streaming requests, to create high load. If you time it right, without this patch it will skip one packet and produce a duplicate with the same completion_tokens on each.

(Also, for some reason, I was unable to reproduce the issue with curl, but only with python aiohttp. I don't know why.)

LinPoly · 2025-06-11T12:34:25Z

@pathorn Thanks for your contribution, did you see this token repetition issue on chat API as well? I think we can try to figure out the root cause within a time window, and if we cannot fix it presumably before the next release, then we can merge this PR as a WAR. thanks again for reporting the issue and giving a solution.

Shang-Pin · 2025-06-18T21:42:56Z

@LinPoly We observed this issue happening for both chat and completion API. We suspect the output object is being updated even while the post_processor is running as we are seeing the create_logprobs_completion crash because the lengths do not match.

LinPoly · 2025-06-25T07:37:26Z

@Shang-Pin Do you happen to have reproduction script for chat or completion endpoint? Thanks!

Signed-off-by: Patrick Reiter Horn <[email protected]>

Shang-Pin · 2025-07-02T17:45:37Z

@LinPoly It usually happens under high load when the fast_api server might be slow to catch up. We don't have a script to reliably produce it. But if you enable logprobs and run the engine under high load, it will usually happen.

LinPoly · 2025-07-03T11:20:59Z

@Shang-Pin Thanks for the info, so I think there are two separate issues:

repetitive tokens with completion API
explicit error with logprob enabled, for both APIs

Not sure if they have the same root cause, now we have a fix PR for the first issue, it does fix a concurrent update bug in completion API, as both of you @Shang-Pin @pathorn suggested. I will check the 2nd issue, and it would be great if you can check the PR to see if it makes sense to you.

Thanks again for using and helping improve TRTLLM.

karljang · 2025-09-30T22:04:05Z

@pathorn , @Shang-Pin ,
Have you had a chance to try the PR shared by @LinPoly above? Alternatively, would you be able to test the recent release version instead?

pathorn · 2025-10-03T10:40:50Z

Sorry, we have not had a chance to test the above change (which was already merged), and even if we did, the production environment has changed and there is no guarantee that we will hit the conditions necessary to reproduce the original issue.

Given that this PR was a hack, I think that it does not make sense to continue with this PR for now, unless we have evidence that the issue still exists.

karljang · 2025-10-03T16:04:14Z

Then, I'll close this PR for now. Please feel free to open new one if needed~
Thank you!

poweiw added triaged Issue has been triaged by maintainers OpenAI API trtllm-serve's OpenAI-compatible API: endpoint behavior, req/resp formats, feature parity. labels Jun 5, 2025

poweiw assigned LinPoly Jun 5, 2025

poweiw added the Community want to contribute PRs initiated from Community label Jun 5, 2025

pathorn force-pushed the coalesce-streaming-diff branch 2 times, most recently from 057bf0f to 3bc85f8 Compare June 5, 2025 22:36

pathorn force-pushed the coalesce-streaming-diff branch 2 times, most recently from c33bd18 to 422162d Compare June 30, 2025 20:40

Coalesce updates to better handle load on high batch sizes

b179fff

Signed-off-by: Patrick Reiter Horn <[email protected]>

pathorn force-pushed the coalesce-streaming-diff branch from 422162d to b179fff Compare June 30, 2025 20:42

pathorn mentioned this pull request Jun 30, 2025

[fix] fix log probs and add for mtp and completion requests #5620

Open

karljang added the waiting for feedback label Sep 30, 2025

karljang closed this Oct 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Coalesce text diffs in streaming requests. #4923

Coalesce text diffs in streaming requests. #4923

Uh oh!

pathorn commented Jun 5, 2025

Uh oh!

LinPoly commented Jun 11, 2025 •

edited

Loading

Uh oh!

Shang-Pin commented Jun 18, 2025

Uh oh!

LinPoly commented Jun 25, 2025

Uh oh!

Shang-Pin commented Jul 2, 2025

Uh oh!

LinPoly commented Jul 3, 2025 •

edited

Loading

Uh oh!

karljang commented Sep 30, 2025

Uh oh!

pathorn commented Oct 3, 2025

Uh oh!

karljang commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Coalesce text diffs in streaming requests. #4923

Coalesce text diffs in streaming requests. #4923

Uh oh!

Conversation

pathorn commented Jun 5, 2025

fix/hack: Coalesce text diffs in streaming requests.

Description

Test Coverage

Uh oh!

LinPoly commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shang-Pin commented Jun 18, 2025

Uh oh!

LinPoly commented Jun 25, 2025

Uh oh!

Shang-Pin commented Jul 2, 2025

Uh oh!

LinPoly commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karljang commented Sep 30, 2025

Uh oh!

pathorn commented Oct 3, 2025

Uh oh!

karljang commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LinPoly commented Jun 11, 2025 •

edited

Loading

LinPoly commented Jul 3, 2025 •

edited

Loading