-
Notifications
You must be signed in to change notification settings - Fork 3k
fix: make inference deterministic for large TP #10930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: make inference deterministic for large TP #10930
Conversation
Co-authored-by: yhyang201 <[email protected]> Co-authored-by: Yangmin Li <[email protected]> Co-authored-by: Yuan Luo <[email protected]>
Summary of ChangesHello @JustinTong0323, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the SGLang framework by extending deterministic inference capabilities to models that utilize Tensor Parallelism (TP) greater than one. Previously, this feature was limited to single-TP setups. The change involves programmatically setting NCCL's all-reduce algorithm and disabling custom all-reduce when TP is active, thereby removing a critical constraint. Additionally, a minor adjustment was made to a test file to ensure robust path handling for test data. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables deterministic inference for tensor parallelism greater than 1 by setting NCCL_ALGO
to allreduce:tree
and disabling custom all-reduce. The changes look reasonable, but I've identified a minor issue in a warning message that could be misleading. The change in the test file to use a relative path is a good improvement for robustness.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Interesting result~ Have you tried on other MoE models, such Mixtral? |
mistralai/Mixtral-8x7B-v0.1 Passed all test on TP=2 with fa3 |
Motivation
Close #10785
This PR enables deterministic inference for models using TP > 1.
Modifications
Set
NCCL_ALGO="allreduce:tree"
and disable custom all-reduce when TP > 1.Accuracy Tests
Tests passed for both Dense/MoE models (TP=2,4,8) across all backends (fa3, flashinfer, triton) using
sglang.test.test_deterministic
.However, we found one exception: the
prefix
test failed forMoEQwen3-30B-A3B with TP=2 and the fa3 backend, generating two unique outputs for prefix length 1.(Butmistralai/Mixtral-8x7B-v0.1
Could pass)Reproduce
You can see result like:
Checklist