refactoring: port customized kernels with public cutlass version #5027

yunruis · 2025-06-09T08:13:43Z

pytest -s tests/unittest/_torch/modules/test_fused_moe.py# refactoring: port customized kernels with public cutlass version

In this PR we have open-sourced some internal Cutlass kernels. Meanwhile, to ensure stability and provide an even more optimized performance experience, we have retained the previous method of calling these kernels via static libraries, as another choice.
The purpose of this document is to introduce how to use these newly open-sourced Cutlass kernels, while also supporting the option to switch back to the previously supported internal Cutlass kernels via static libraries.

Compilation
These open-sourced cutlass kernels are low_latency_gemm, moe-gemm, fp4_gemm and allreduce_gemm. The switch between using open-sourced Cutlass kernels and static library Cutlass kernels can be made using the macro USING_OSS_CUTLASS_* achieving kernel-level control. By default, the open-source Cutlass kernels are used. For example:

python3 ./scripts/build_wheel.py --skip_building_wheel --linking_install_binary --use_ccache  --cuda_architectures "90-real;100-real" --python_bindings --install --micro_benchmarks

This will using open-sourced cutlass kernels.

If users prefer to use the internal Cutlass kernels from the static library, they can control this during compilation by setting marco USING_OSS_CUTLASS_* to OFF. For instance, if a user wants to use the static library implementation for low_latency_gemm and fused_moe_gemm, the following compilation command can be used:

python3 ./scripts/build_wheel.py --skip_building_wheel --linking_install_binary --use_ccache  --cuda_architectures "90-real;100-real" -D "USING_OSS_CUTLASS_MOE_GEMM=OFF;USING_OSS_CUTLASS_LOW_LATENCY_GEMM=OFF" --python_bindings --install --micro_benchmarks

yunruis · 2025-06-10T09:45:20Z

/bot run

tensorrt-cicd · 2025-06-10T09:50:47Z

PR_Github #8271 [ run ] triggered by Bot

… kernels Signed-off-by: yunruis <[email protected]> moe_gemm passed Signed-off-by: yunruis <[email protected]> fix license bug Signed-off-by: yunruis <[email protected]> waive debug mode Signed-off-by: yunruis <[email protected]> fix debug mode compile bug Signed-off-by: yunruis <[email protected]> open source GEMM+AR kernels contains blackwell fixes support all reduce_gemm cutlass kernel Signed-off-by: yunruis <[email protected]> fix bug Signed-off-by: yunruis <[email protected]> fix credential symbol Signed-off-by: yunruis <[email protected]> drop credential symbol Signed-off-by: yunruis <[email protected]> add debug info and test ok Signed-off-by: yunruis <[email protected]> fix loraparams namespace bug Signed-off-by: yunruis <[email protected]> fix rebase bug Signed-off-by: yunruis <[email protected]> fix moe gemm bug on sm90 Signed-off-by: yunruis <[email protected]> fix low_latency_gemm internal error Signed-off-by: yunruis <[email protected]>

…upport FP8xMXFP4. And add open-sourced moe_gemm micro-benchmark and unittest Signed-off-by: yunruis <[email protected]>

tensorrt-cicd · 2025-06-10T10:29:22Z

PR_Github #8271 [ run ] completed with state ABORTED

yunruis · 2025-06-10T10:33:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-10T10:39:12Z

PR_Github #8280 [ run ] triggered by Bot

Signed-off-by: yunruis <[email protected]>

yunruis · 2025-06-10T10:45:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-10T10:50:53Z

PR_Github #8282 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-10T10:50:56Z

PR_Github #8280 [ run ] completed with state ABORTED

juney-nvidia

Thanks for the hard working, Yunrui.
Let's merge this PR ASAP to unblock the dependency and keep refining it in the subsequent PRs.

Signed-off-by: yunruis <[email protected]>

yunruis · 2025-06-10T15:36:34Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-10T15:43:22Z

PR_Github #8317 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-10T15:43:26Z

PR_Github #8282 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #5995 completed with status: 'FAILURE'

Signed-off-by: yunruis <[email protected]>

tensorrt-cicd · 2025-06-10T15:58:40Z

PR_Github #8317 [ run ] completed with state ABORTED

yunruis · 2025-06-10T15:59:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-10T16:04:45Z

PR_Github #8324 [ run ] triggered by Bot

yunruis · 2025-06-10T16:04:47Z

/bot kill

tensorrt-cicd · 2025-06-10T16:06:01Z

PR_Github #8324 [ run ] completed with state ABORTED

Signed-off-by: yunruis <[email protected]>

yunruis · 2025-06-12T03:33:25Z

/bot run

tensorrt-cicd · 2025-06-12T03:39:31Z

PR_Github #8594 [ run ] triggered by Bot

jenkins/Build.groovy

tensorrt-cicd · 2025-06-13T03:40:22Z

PR_Github #8594 [ run ] completed with state ABORTED

juney-nvidia · 2025-06-13T03:42:44Z

/bot help

github-actions · 2025-06-13T03:42:51Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

juney-nvidia · 2025-06-13T03:44:06Z

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

tensorrt-cicd · 2025-06-13T03:49:37Z

PR_Github #8730 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-13T07:56:27Z

PR_Github #8730 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6333 (Partly Tested) completed with status: 'SUCCESS'

ZhanruiSunCh · 2025-06-13T08:04:55Z

/bot skip --comment "PR_Github #8730 and PR_Github #8594 run a full pre-merge CI Pipeline with multi GPU test"

tensorrt-cicd · 2025-06-13T08:10:24Z

PR_Github #8768 [ skip ] triggered by Bot

tensorrt-cicd · 2025-06-13T08:23:33Z

PR_Github #8768 [ skip ] completed with state SUCCESS
Skipping testing for commit 758490d

Signed-off-by: Tailing Yuan <[email protected]>

yunruis requested a review from a team as a code owner June 9, 2025 08:13

yunruis requested review from dongxuy04 and yuxianq June 9, 2025 08:13

juney-nvidia changed the title ~~User/yunruis/opened internal cutlass rebased~~ refactoring: port customized kernels with public cutlass version Jun 9, 2025

yunruis force-pushed the user/yunruis/opened_internal_cutlass_rebased branch from fd44de3 to 7bcd459 Compare June 10, 2025 09:38

yunruis added 2 commits June 10, 2025 03:22

adapt open-sourced moe gemm to previous version, while internal-lib s…

479b0b7

…upport FP8xMXFP4. And add open-sourced moe_gemm micro-benchmark and unittest Signed-off-by: yunruis <[email protected]>

yunruis force-pushed the user/yunruis/opened_internal_cutlass_rebased branch from 7bcd459 to d0da265 Compare June 10, 2025 10:27

fix internal-cutlass switch bug

6d27e20

Signed-off-by: yunruis <[email protected]>

yunruis force-pushed the user/yunruis/opened_internal_cutlass_rebased branch from d0da265 to 6d27e20 Compare June 10, 2025 10:41

juney-nvidia approved these changes Jun 10, 2025

View reviewed changes

delete debug info and change marco to publci

503540f

Signed-off-by: yunruis <[email protected]>

fix historical license bug

06f28ae

Signed-off-by: yunruis <[email protected]>

yunruis force-pushed the user/yunruis/opened_internal_cutlass_rebased branch from 0bcd83d to 06f28ae Compare June 10, 2025 15:58

yunruis requested review from a team as code owners June 12, 2025 03:26

yunruis requested review from suyoggupta and HuiGao-NV June 12, 2025 03:26

set LO BUILD_JOB 8->4 to fix oom error, will revert after CI

53fb6b6

Signed-off-by: yunruis <[email protected]>

yunruis force-pushed the user/yunruis/opened_internal_cutlass_rebased branch from 79ec409 to 53fb6b6 Compare June 12, 2025 03:30

Merge branch 'main' into user/yunruis/opened_internal_cutlass_rebased

758490d

juney-nvidia enabled auto-merge (squash) June 12, 2025 12:46

chzblych reviewed Jun 12, 2025

View reviewed changes

jenkins/Build.groovy Show resolved Hide resolved

ZhanruiSunCh disabled auto-merge June 13, 2025 08:06

juney-nvidia approved these changes Jun 13, 2025

View reviewed changes

juney-nvidia merged commit 30c5b41 into NVIDIA:main Jun 13, 2025
2 of 3 checks passed

yuantailing mentioned this pull request Jun 14, 2025

feat: large-scale EP(part 7: DeepEP integration) #4792

Merged

yuantailing mentioned this pull request Jun 14, 2025

Fix: Double build time limit since #5027 halfs NUM_JOBS #5212

Closed

yuantailing added a commit to yuantailing/TensorRT-LLM that referenced this pull request Jun 14, 2025

Double build time since NVIDIA#5027 half NUM_JOBS

93e7641

Signed-off-by: Tailing Yuan <[email protected]>

yunruis deleted the user/yunruis/opened_internal_cutlass_rebased branch June 27, 2025 02:44

yunruis restored the user/yunruis/opened_internal_cutlass_rebased branch June 27, 2025 02:44

yunruis deleted the user/yunruis/opened_internal_cutlass_rebased branch August 15, 2025 08:25

refactoring: port customized kernels with public cutlass version #5027

refactoring: port customized kernels with public cutlass version #5027

Uh oh!

Conversation

yunruis commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yunruis commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

yunruis commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

yunruis commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

juney-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

yunruis commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

yunruis commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

yunruis commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

yunruis commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

juney-nvidia commented Jun 13, 2025

Uh oh!

github-actions bot commented Jun 13, 2025

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

juney-nvidia commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

ZhanruiSunCh commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

Uh oh!

yunruis commented Jun 9, 2025 •

edited

Loading