-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Feat][EPLB] A novel static EPLB placement strategy for MoE models. #23745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a novel "Zigzag" static expert placement strategy for MoE models, which is a welcome performance optimization. The implementation is mostly sound, with the necessary configuration options and logic added. My review includes two main points of feedback. Firstly, there's some redundant code in the new zigzag placement logic that can be removed to improve clarity and correctness. Secondly, the assertion for validating the zigzag placement configuration could be improved by splitting it into multiple assertions with more specific error messages, which would enhance the developer experience when debugging configuration issues.
f2add2d
to
a89dcd3
Compare
You can fix the pre-commit about |
Done. Could you please have a further review? Thanks |
Should I'm wondering if we should bother to make this configurable and just use |
Thanks for the reply! Yes, I believe Zigzag could be a more suitable default placement strategy for grouped expert models. |
QQ: are you using random dataset for benchmarking here? |
Just curious why the accuracy is good since this PR doesn't seem to be modifying the weight loader; the weights are loaded onto GPUs assuming say experts [0, 1, 2, 3] to GPU 0 and [4, 5, 6, 7] to GPU 1, etc. Could you please provide the scripts you used for benchmarking and accuracy tests? Thank you! |
Also, however, with EPLB enabled, I don’t think this holds, as it directly breaks the assumption about the physical experts’ locations in the algorithm; moreover, the EPLB algorithm already accounts for expert groups. |
@abmfy |
As for the weight loading, in my memory, the calling order is roughly: My change (static expert placement) is applied during model initialization, before load_weights works. In other words, I finalize the expert_map first, and that mapping is then used for weight loader. |
Static placement just decides the initial expert-to-device mapping. Dynamic EPLB still has full control at runtime to rebalance traffic or remap according to its own strategy. They don’t interfere; in fact, they can complement each other, because a well-chosen static placement can provide a good starting point. while dynamic EPLB continues to adapt to load patterns. |
Thanks for your quick response and the contribution!
No, actually the weight loader relies on the mapping in fused_moe.py#L1739-L1768. But this PR doesn’t seem to be doing that (though it’s certainly doable), which is why I’m asking.
That’s good to hear! I’m not questioning the reliability — just a bit concerned that some modifications (e.g., to the weight loader since it doesn’t use the
I think we can add an option to enable the round-robin arrangement, but it shouldn’t be the default, since I’m concerned that other components (e.g., the EP kernels) may rely on the current linear expert arrangement. In the current EPLB implementation, it also assumes a linear arrangement of physical experts, so the two cannot work together. That said, I agree we could support the round-robin arrangement only when EPLB is disabled, since EPLB is typically used with redundant experts and adapting round-robin in that case would be difficult. |
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: Chen Bruce <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: bruceszchen <[email protected]>
…g it by default. Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Chen Bruce <[email protected]>
Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Chen Bruce <[email protected]>
3e90dd0
to
2e88bcc
Compare
Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
…llm-project#23745) Signed-off-by: bruceszchen <[email protected]> Signed-off-by: Chen Bruce <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Bruce <[email protected]> Co-authored-by: lemon412 <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
Purpose
This PR introduces a novel static expert load balancing placement strategy (called Zigzag) designed for MoE models with multiple expert groups, such as the DeepSeek series.
Through our heatmap analysis, we observed that in multi-expert-group MoE models such as DeepSeek, experts within the same group tend to be selected together in practical scenarios. Therefore, distributing them across different devices can bring performance benefits.
The zigzag expert placement feature has been validated on DeepSeek-R1, demonstrating ~8% improvement in QPM (Queries Per Minute) compared to the default configuration during our online serving benchmarking on a single node with h20*8.
The zigzag strategy optimizes how experts are distributed across parallel ranks by implementing a staggered placement pattern, which helps achieve better load balancing across expert parallel groups. This is particularly beneficial for models that use grouped top-k routing, where experts are organized into logical groups and the routing decisions are made within these groups. The implementation ensures that experts are distributed more evenly across ranks, reducing load imbalance and improving overall throughput performance in production environments.
Performance
Test Platform:
Vllm version: vllm/vllm-openai:v0.10.1.1
Model: DeepSeek-V2-Chat-0628,
GPU: H20 * 8
Parallel config 1: tp=8, enable_expert_parallel=True
Benchmark config: input_len=1024, output_len=512, request_rate=8, max_concurrency=8, num_prompts=32:
python3 ./bench_serving.py
--backend vllm
--dataset-name random
--model ${MODEL_PATH}
--random-input-len 1024
--random-output-len 128
--random-range-ratio 0.5
--tokenizer ./tokenizer
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json
--request-rate 8
--max-concurrency 8
--num-prompts 32
--base-url http://127.0.0.1:8000
--port 8000
Conclusion: With only expert parallelism enabled, Zigzag improves throughput and end-to-end latency by approximately 3%.
Accuracy Test
Tested with Deepseek-v2-chat-0628 on h20*8 with following serving cmd:
Note: Deepseek-v2 has a bad behavior on our chosen dataset, just to make sure zigzag has no impact on accuracy.
Tested with Deepseek-R1-0528 on h20*8 and verified zigzag has no impact on accuracy.
Usage
To try out Zigzag static EPLB strategy, enable it with the following options:
Compatibility
The zigzag pattern is designed for MoE models with multiple expert groups, such as the DeepSeek series. Note that this method cannot benefit from MoE models without expert groups.