[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP #5435

syuoni · 2025-06-24T10:18:08Z

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP

Description

This PR implements the sort logics before MoE GEMMs, and replaces the original CUB sort invocation.

In a typical large-scale EP workload (EP=32 and per-gpu batch=128):

Before this PR: 5 kernels
- buildExpertMapsKernel: 10.3 us
- CUB sort (three kernels): 11.9 us
- computeExpertFirstTokenOffsetKernel: 4.6 us
- In addition, we see significant bubbles between CUB kernels on B200.
After this PR: 3 kernels
- blockExpertPrefixSumKernel: 2.3 us
- globalExpertPrefixSumKernel: 2.3 us
- mergeExpertPrefixSumKernel: 2.4 us

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

syuoni · 2025-06-26T06:54:52Z

/bot run

tensorrt-cicd · 2025-06-26T07:01:51Z

PR_Github #9993 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-26T09:55:03Z

PR_Github #9993 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7370 completed with status: 'FAILURE'

syuoni · 2025-06-26T10:31:16Z

/bot run

tensorrt-cicd · 2025-06-26T10:36:53Z

PR_Github #10027 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-26T14:28:01Z

PR_Github #10027 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7399 completed with status: 'SUCCESS'

syuoni · 2025-06-26T14:33:37Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-06-26T14:38:31Z

PR_Github #10043 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-26T15:51:31Z

PR_Github #10043 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #7411 completed with status: 'FAILURE'

djns99

Thanks for this work! I think we need to get rid of this assumption before we can merge this though unfortunately:

// This allows accommodating 256 experts x 64k tokens; reasonable workload should not exceed this

I also think we should try be less wasteful with our block sizes. In the worst assumed case above (assuming topk=8) we are launching 16M threads, of which only 256k contribute anything

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

djns99 · 2025-06-26T22:09:54Z

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

See comment above about using BlockRadixRank we can reduce this to only num_tokens*topk threads.

The final permuted idx is:

selected_expert = token_selected_experts[blockIdx.x * blockDim.x + threadIdx.x]; dest_token_id = expert_first_token_offset[selected_expert] + (block_rank[blockIdx.x][threadIdx.x] - block_exclusive_digit_prefix[blockIdx.x][selected_expert]);

I don't think I fully understand your comment. If using BlockRadixRank, what is the gridDim and blockDim?

The total number of threads should be num_tokens*topk we can divide these into blocks however we want. Its an embarassingly parallel operation in the case of mergeExpertPrefixSumKernel

I see your point. Yes, mergeExpertPrefixSumKernel can be optimized as I reply above, thanks!

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

syuoni · 2025-06-27T16:43:08Z

/bot run

tensorrt-cicd · 2025-06-27T16:47:56Z

PR_Github #10173 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-27T19:11:11Z

PR_Github #10173 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7511 completed with status: 'SUCCESS'

syuoni · 2025-06-28T01:26:16Z

/bot run --add-multi-gpu-test --disable-fail-fast

Signed-off-by: Enwei Zhu <[email protected]> refactor Signed-off-by: Enwei Zhu <[email protected]> integration Signed-off-by: Enwei Zhu <[email protected]> fix large workload Signed-off-by: Enwei Zhu <[email protected]> fix PDL Signed-off-by: Enwei Zhu <[email protected]> fix Signed-off-by: Enwei Zhu <[email protected]> fix large workload Signed-off-by: Enwei Zhu <[email protected]> clean unused Signed-off-by: Enwei Zhu <[email protected]> fix profiler Signed-off-by: Enwei Zhu <[email protected]> move reserve from expandInput Signed-off-by: Enwei Zhu <[email protected]>

Signed-off-by: Enwei Zhu <[email protected]>

syuoni · 2025-06-28T02:30:59Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-06-28T02:36:07Z

PR_Github #10185 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-28T07:43:14Z

PR_Github #10185 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7518 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

juney-nvidia · 2025-06-29T17:00:40Z

Let's merge this PR to unblock the E2E optimization of Lage-EP and continue the refinements in the subsequent PRs.

Hi Daniel,

We need to unblock the Large-scale EP E2E performance optimizations and also I noticed that most of the comments left for this PR has been addressed by Enwei, so for now I will unblock the merge of this PR.
Enwei will work with you to discuss the further refinement of the related logics.

Thanks
June

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

… EP (NVIDIA#5435)" This reverts commit b4dab23.

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

syuoni force-pushed the opt-moe-sort branch 2 times, most recently from d003a6b to a9f032d Compare June 26, 2025 06:51

syuoni requested review from djns99, dongxuy04, hlu1 and qiaoxj07 June 26, 2025 06:54

syuoni self-assigned this Jun 26, 2025

syuoni marked this pull request as ready for review June 26, 2025 06:54

syuoni requested review from ChristinaZ, bobboli and juney-nvidia June 26, 2025 07:08

qiaoxj07 approved these changes Jun 26, 2025

View reviewed changes

djns99 previously requested changes Jun 26, 2025

View reviewed changes

syuoni force-pushed the opt-moe-sort branch from 048da12 to af83a8b Compare June 27, 2025 15:48

syuoni added 3 commits June 28, 2025 02:04

fix indice

2060945

Signed-off-by: Enwei Zhu <[email protected]>

fix namings

96033fa

Signed-off-by: Enwei Zhu <[email protected]>

syuoni added 6 commits June 28, 2025 02:06

fix namings part2

50f49b7

Signed-off-by: Enwei Zhu <[email protected]>

fix namings part3

e5e37e0

Signed-off-by: Enwei Zhu <[email protected]>

remove template kNumTokensPerBlock

30ee636

Signed-off-by: Enwei Zhu <[email protected]>

remove limit

0603a26

Signed-off-by: Enwei Zhu <[email protected]>

comment

5f84524

Signed-off-by: Enwei Zhu <[email protected]>

fix rebase

b8ff979

Signed-off-by: Enwei Zhu <[email protected]>

syuoni force-pushed the opt-moe-sort branch from af83a8b to b8ff979 Compare June 28, 2025 02:30

juney-nvidia approved these changes Jun 29, 2025

View reviewed changes

juney-nvidia merged commit b4dab23 into NVIDIA:main Jun 29, 2025
3 checks passed

ameynaik-hub pushed a commit to ameynaik-hub/TensorRT-LLM that referenced this pull request Jun 30, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

93206a8

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

syuoni added a commit to syuoni/TensorRT-LLM that referenced this pull request Jul 1, 2025

Revert "[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale…

91f2cc3

… EP (NVIDIA#5435)" This reverts commit b4dab23.

Shunkangz pushed a commit to Shunkangz/TensorRT-LLM that referenced this pull request Jul 2, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

ed71be2

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 9, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

e465b3c

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

f1311d8

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

0873599

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

93761ea

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

5fcbda6

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

0e5352d

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

42a23a2

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

1152c26

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

nvzhihanj pushed a commit to nvzhihanj/TensorRT-LLM that referenced this pull request Jul 17, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

d195659

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

nvzhihanj mentioned this pull request Jul 17, 2025

Cherry-pick moe sort (and all its dependencies) #6127

Merged

nvzhihanj pushed a commit to nvzhihanj/TensorRT-LLM that referenced this pull request Jul 26, 2025

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (NVI…

aab1f93

…DIA#5435) Signed-off-by: Enwei Zhu <[email protected]>

syuoni deleted the opt-moe-sort branch July 31, 2025 03:28

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP #5435

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP #5435

Uh oh!

Conversation

syuoni commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

syuoni commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

syuoni commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

syuoni commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

djns99 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

djns99 Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

syuoni Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

djns99 Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

syuoni Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

syuoni commented Jun 27, 2025

Uh oh!

tensorrt-cicd commented Jun 27, 2025

Uh oh!

tensorrt-cicd commented Jun 27, 2025

Uh oh!

syuoni commented Jun 28, 2025

Uh oh!

syuoni commented Jun 28, 2025

Uh oh!

tensorrt-cicd commented Jun 28, 2025

Uh oh!

tensorrt-cicd commented Jun 28, 2025

Uh oh!

juney-nvidia commented Jun 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

syuoni commented Jun 24, 2025 •

edited

Loading

djns99 Jun 26, 2025 •

edited

Loading