-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP #5435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d003a6b to
a9f032d
Compare
|
/bot run |
|
PR_Github #9993 [ run ] triggered by Bot |
|
PR_Github #9993 [ run ] completed with state |
|
/bot run |
|
PR_Github #10027 [ run ] triggered by Bot |
|
PR_Github #10027 [ run ] completed with state |
|
/bot run --add-multi-gpu-test |
|
PR_Github #10043 [ run ] triggered by Bot |
|
PR_Github #10043 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work! I think we need to get rid of this assumption before we can merge this though unfortunately:
// This allows accommodating 256 experts x 64k tokens; reasonable workload should not exceed this
I also think we should try be less wasteful with our block sizes. In the worst assumed case above (assuming topk=8) we are launching 16M threads, of which only 256k contribute anything
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above about using BlockRadixRank we can reduce this to only num_tokens*topk threads.
The final permuted idx is:
selected_expert = token_selected_experts[blockIdx.x * blockDim.x + threadIdx.x];
dest_token_id = expert_first_token_offset[selected_expert] + (block_rank[blockIdx.x][threadIdx.x] - block_exclusive_digit_prefix[blockIdx.x][selected_expert]);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I fully understand your comment. If using BlockRadixRank, what is the gridDim and blockDim?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total number of threads should be num_tokens*topk we can divide these into blocks however we want. Its an embarassingly parallel operation in the case of mergeExpertPrefixSumKernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Yes, mergeExpertPrefixSumKernel can be optimized as I reply above, thanks!
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
|
/bot run |
|
PR_Github #10173 [ run ] triggered by Bot |
|
PR_Github #10173 [ run ] completed with state |
|
/bot run --add-multi-gpu-test --disable-fail-fast |
Signed-off-by: Enwei Zhu <[email protected]> refactor Signed-off-by: Enwei Zhu <[email protected]> integration Signed-off-by: Enwei Zhu <[email protected]> fix large workload Signed-off-by: Enwei Zhu <[email protected]> fix PDL Signed-off-by: Enwei Zhu <[email protected]> fix Signed-off-by: Enwei Zhu <[email protected]> fix large workload Signed-off-by: Enwei Zhu <[email protected]> clean unused Signed-off-by: Enwei Zhu <[email protected]> fix profiler Signed-off-by: Enwei Zhu <[email protected]> move reserve from expandInput Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #10185 [ run ] triggered by Bot |
|
PR_Github #10185 [ run ] completed with state |
|
Let's merge this PR to unblock the E2E optimization of Lage-EP and continue the refinements in the subsequent PRs. |
Hi Daniel,
We need to unblock the Large-scale EP E2E performance optimizations and also I noticed that most of the comments left for this PR has been addressed by Enwei, so for now I will unblock the merge of this PR.
Enwei will work with you to discuss the further refinement of the related logics.
Thanks
June
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
… EP (NVIDIA#5435)" This reverts commit b4dab23.
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>
[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP
Description
This PR implements the sort logics before MoE GEMMs, and replaces the original CUB sort invocation.
In a typical large-scale EP workload (EP=32 and per-gpu batch=128):
Before this PR: 5 kernels
After this PR: 3 kernels
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.