Skip to content

Conversation

Jialin
Copy link
Contributor

@Jialin Jialin commented Oct 7, 2025

Purpose

The default setup of GC is (700, 10, 10) which means

  • if allocated_obj-deallocated_obj>=700 in generation 0, GC0 will be triggered
  • GC1 is triggered after 10 GC0
  • GC2 is triggered after 10 GC1
    In this scenario, large batch size scenarios (small models) each batch could be as large as 1024, which means GC0 will be triggered per decode cycle, GC1 will triggered per 10 decode cycle and GC2 per 100 decode cycle, which is very inefficient!

In this PR, we change output tokens from list[int] to Union[int, list[int]] to cut down the GC0 triggering frequency.

Test Plan & Test Result

  • Test 1: facebook/125m TP1 Input 48 Output 2000
  • Test 2: facebook/125m TP1 Input 48 Output 500
  • Test 3: Llama3 8B TP2 Input 48 Output 500

Request Throughput Change

  • Test 1: facebook/125m Long Output: 30+% (from 10.04 request/s -> 13.11 request/s)
  • Test 2: facebook/125m Short Output: 5+% (from 48.95 request/s -> 51.66 request/s)
  • Test 3: Llama3 8B: 6+% (from 28.93 request/s -> 30.70 request/s)

With the change GC improved significantly.
Screenshot 2025-10-16 at 2 14 24 AM


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@Jialin
Copy link
Contributor Author

Jialin commented Oct 7, 2025

CC @yeqcharlotte @houseroad @WoosukKwon @njhill for RFC on the proposal and high level code changes, but the PR might not be completely ready yet

  • Callers might not be updated completely
  • Unit tests are still WIP

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a clever optimization to reduce garbage collection overhead by changing the representation of single output tokens from a list to an integer. The use of Union[int, list[int]] is a good way to handle both single and multiple token outputs efficiently. The changes are consistently applied across the codebase. My only feedback is to add unit tests for the new utility functions to ensure the correctness and robustness of this performance-critical change.

@Jialin Jialin changed the title Single output token [Core] Performance: Use int instead of list[int] for single output token for GC optimization Oct 7, 2025
@Jialin
Copy link
Contributor Author

Jialin commented Oct 7, 2025

Resolve #26369

@Jialin
Copy link
Contributor Author

Jialin commented Oct 13, 2025

Gentle nudge @houseroad for the review :P

@mergify
Copy link

mergify bot commented Oct 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Jialin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 14, 2025
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
@Jialin Jialin force-pushed the single_output_token branch from 8512558 to 02914f9 Compare October 14, 2025 07:30
@mergify mergify bot removed the needs-rebase label Oct 14, 2025
Signed-off-by: Jialin Ouyang <[email protected]>
@yeqcharlotte
Copy link
Collaborator

yeqcharlotte commented Oct 14, 2025

thanks for the change. hhh it's hard to keep classes around to avoid GC. wondering what's the actual gain we see from bigger models? we can play with deepseek v3, qwen-235b etc. and try it for both throughput and latency sensitive use cases.

wondering if you folks have strong opinion for this @heheda12345 @njhill

Signed-off-by: Jialin Ouyang <[email protected]>
@zhuohan123
Copy link
Member

zhuohan123 commented Oct 15, 2025

I am a bit hesitant on this PR. I feel it's over optimization. To achieve low latency, we probably should invest in:

  • Replace data structures with numpy arrays
  • Async scheduling
  • If Python overhead is actually that serious, we probably should seriously think about rewriting the scheduler in C++.

I am worried that optimizations like the one in this PR can pile up and make the code much harder for people to understand without actually improving the perf very significantly.

@Jialin
Copy link
Contributor Author

Jialin commented Oct 16, 2025

I am a bit hesitant on this PR. I feel it's over optimization. To achieve low latency, we probably should invest in:

  • Replace data structures with numpy arrays
  • Async scheduling
  • If Python overhead is actually that serious, we probably should seriously think about rewriting the scheduler in C++.

I am worried that optimizations like the one in this PR can pile up and make the code much harder for people to understand without actually improving the perf very significantly.

Thanks @zhuohan123 for your suggestions and I totally aligned with the tradeoff between code complexity and performance. But would love to better clarify my change in this PR.

  • Motivation: reduce GC collect frequency by using list[int] for single element scenarios, but not due to to performance of list[int] itself. PR summary also had more details about the rational.
  • Improvements: we've included a bit more validations in the PR summary, overall, we believe the improvement is significant for small models and large batch size scenarios (facebook/opt125m 30+% throughput improvement and Llama 8B 6+% throughput improvements)
  • Alternatives:
    • Replace with numpy arrays: That's a great call and I didn't think of earlier. I think it's cleaner also won't involve in GC and more friendly for both non-spec decoding and spec decoding scenarios. But that also required some code changes, so would love to get more of your opinions in this change before investing more time on this direction.
    • Async scheduling won't help in this scenario, as long as the data structure stay as is, GC would kick in regardless and pause the whole process.
    • Just rewrite scheduler in C++ also might not help, as long as we still use python interface and use list[int] to pass single output tokens around, GC would still happen.

Next step: @zhuohan123 per the new data and new clarification I provided, do you feel that's something we should worth to further investigate? If yes, I could try to explore the np.array suggestion more. Appreciate that.

@Jialin
Copy link
Contributor Author

Jialin commented Oct 16, 2025

we can play with deepseek v3, qwen-235b etc. and try it for both throughput and latency sensitive use cases.

@yeqcharlotte To be very honest, this change might only be helpful for small models and large batch size scenarios. So the effectiveness could be small to large models like deepseek v3 and qwen 235b :/

@heheda12345
Copy link
Collaborator

I'm OK with the merged PRs for GC optimizations inside KVCacheManager as I do get some complain about KVCacheManager overhead. But I believe optimizing GC for the whole project is not very practical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants