[Feat][EPLB][Perf] Enable Round-robin expert placement strategy while eplb is enabled. #25798
+107
−35
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
PR-23745 introduced the round-robin expert placement strategy for MoE models with multiple expert groups, providing a simple yet effective way to distribute experts evenly across devices.
This PR extends that work by ensuring full compatibility with EPLB (Expert Parallel Load Balancing). With this enhancement, round-robin placement can now be seamlessly combined with dynamic expert load balancing, enabling more flexible expert scheduling while maintaining balanced utilization and performance.
Performance
Conclusion: With configurations list below, when eplb is enabled, the round-robin strategy improves avg. throughput and end-to-end latency by approximately 3% than default linear strategy.
Test Platform:
Vllm version: vllm/vllm-openai:nightly-8c546102658f97b10d13bcf25193b65edc6ea6ff
Model: DeepSeek-V2-Chat-0628,
GPU: H20 * 8
Serving mode config :
python3 -u -m vllm.entrypoints.openai.api_server
--model ${MODEL_PATH}
--trust-remote-code
--gpu-memory-utilization 0.85
-tp 8 \
--enable-expert-parallel
--enable-eplb
--expert-placement-strategy "round_robin"
Benchmark config: input_len=1024, output_len=128, request_rate=4, max_concurrency=4, num_prompts=32:
python3 ./bench_serving.py
--backend vllm
--dataset-name random
--model ${MODEL_PATH}
--random-input-len 1024
--random-output-len 128
--random-range-ratio 0.5
--tokenizer ./tokenizer
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json
--request-rate 4
--max-concurrency 4
--num-prompts 32
--base-url http://127.0.0.1:8000
--port 8000
Accuracy Test
Tested with Deepseek-v2-chat-0628 on h20*8 with following serving cmd:
Note: Deepseek-v2 has a bad behavior on our chosen dataset, just to make sure this PR has no impact on accuracy.