[Kernels][DP/EP] Optimize Silu Kernel for R1 #24054

elvircrn · 2025-09-01T15:57:44Z

Purpose

The purpose of this PR is to replace the triton silu implementation with a faster cuda version.

Here's a benchmark slice:

This was achieved by launching additional cuda blocks and parallelizing over T dimension. This means that we end up launching NOOP threads and that the parallelization factor is now an additional tunable parameter.

To understand the impact of the parallelization factor, see the following graphs for E <=9 :

silu_benchmark_experts = torch randint(0, T, size=(E,)); sort(experts)_3

silu_benchmark_experts = torch randint(0, T, size=(E,))_0

For E=32, we have:

16X seems like a parallelization that works well for most configuration - this was chosen as the default.

Test Plan

Given y of shape (E, T, 2H) as input the function is expected to work for all GROUP_SIZE=128 and H divisible by 128.

Test Result

VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1 g2 lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-30B-A3B-Instruct-2507-FP8,data_parallel_size=2,enable_expert_parallel=True --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

vllm (pretrained=Qwen/Qwen3-30B-A3B-Instruct-2507-FP8,data_parallel_size=2,enable_expert_parallel=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8848|±  |0.0088|
|     |       |strict-match    |     5|exact_match|↑  |0.8711|±  |0.0092|

From main:

vllm (pretrained=Qwen/Qwen3-30B-A3B-Instruct-2507-FP8,data_parallel_size=2,enable_expert_parallel=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8848|±  |0.0088|
|     |       |strict-match    |     5|exact_match|↑  |0.8711|±  |0.0092|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a new CUDA kernel for silu_mul_fp8_quant to replace the existing Triton implementation, aiming for better performance. The changes include the new CUDA kernel, its C++ bindings, and updates to tests and benchmarks to compare against the old implementation, which is preserved as a baseline. While the overall approach is sound, the review identified several critical correctness issues in the new CUDA kernel related to parameter handling, as well as high-severity maintainability problems such as dead code, code duplication, and style violations. These issues should be addressed to ensure the correctness and long-term health of the codebase.

csrc/quantization/activation_kernels.cu

gemini-code-assist · 2025-09-01T16:00:06Z

csrc/quantization/activation_kernels.cu

+
+    // quant params
+    float fp8_min, float fp8_max) {
+  static constexpr float EPS = 1e-10;


The kernel uses a hardcoded EPS value, ignoring the eps parameter passed to the host function silu_mul_fp8_quant_deep_gemm_cuda. This is a correctness bug. The eps value should be passed to this kernel as an argument and used instead of the hardcoded constant.

This will require changes in:

The kernel signature to accept eps.

The kernel body to use the eps argument (e.g., float y_max = eps; on line 265).

The host launcher silu_mul_fp8_quant_deep_gemm_cuda to pass eps to the kernel.

@LucasWilkinson Is this OK?

Maybe we should remove eps/fp8_max/fp8_min from the python API?

csrc/quantization/activation_kernels.cu

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

robertgshaw2-redhat · 2025-09-03T01:15:58Z

I got the following:

(APIServer pid=1) (EngineCore_0 pid=283) RuntimeError: Worker failed with error ''Keyword argument NUM_WARPS was specified but unrecognised'', please check the stack trace above for the root cause

root cause higher up in the stack trace

(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]     a2q, a2q_scale = silu_mul_fp8_quant_deep_gemm(workspace1,
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]   File "/opt/vllm-source/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py", line 206, in silu_mul_fp8_quant_deep_gemm
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]     _silu_mul_fp8_quant_deep_gemm[grid](
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]   File "/opt/vllm/lib64/python3.12/site-packages/triton/runtime/jit.py", line 390, in <lambda>
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It looks like the triton kernel is still hooked up to the actual layer.

robertgshaw2-redhat · 2025-09-03T01:20:56Z

made branch on nm repo that hooks up the kernel: 376129e

I still have an IMA:

RuntimeError: CUDA driver error: an illegal memory access was encountered

Im trying to build it locally

LucasWilkinson · 2025-09-03T13:08:26Z

made branch on nm repo that hooks up the kernel: 376129e

I still have an IMA:
RuntimeError: CUDA driver error: an illegal memory access was encountered
Im trying to build it locally

@elvircrn would a coredump help? https://blog.vllm.ai/2025/08/11/cuda-debugging.html @robertgshaw2-redhat would you be able to enable this environment variables?

elvircrn · 2025-09-03T13:12:58Z

@LucasWilkinson Yes, a coredump would help, thanks!

elvircrn · 2025-09-03T13:23:25Z

@robertgshaw2-redhat @LucasWilkinson Can H ever not be divisible by 128?

Signed-off-by: elvircrn <[email protected]>

elvircrn · 2025-09-05T16:16:21Z

/gemini-review

gemini-code-assist · 2025-09-05T16:16:33Z

I've received your request for a new review. Please note that this is a command for the system to initiate a new review, and I will not be providing a conversational response here. The review process will begin shortly.

Signed-off-by: elvircrn <[email protected]>

elvircrn · 2025-09-12T15:16:35Z

@LucasWilkinson @tlrmchlsmth

Blackwell failing test seems to be resolved with: [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe #24750 but fails on main
Neuron soft failing with: bash: .buildkite/scripts/hardware_ci/run-neuron-test.sh: No such file or directory also on main.
TPU tests also failing on main

Used this for reference a5b84f1

tlrmchlsmth · 2025-09-12T15:21:45Z

@LucasWilkinson @tlrmchlsmth

Blackwell failing test seems to be resolved with: [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe #24750 but fails on main

Neuron soft failing with: bash: .buildkite/scripts/hardware_ci/run-neuron-test.sh: No such file or directory also on main.

TPU tests also failing on main

Used this for reference a5b84f1

sounds good - let's wait for distributed-tests-2-gpus and entrypoints-integration-test-api-server to finish and then we can request a force merge

elvircrn · 2025-09-12T15:51:51Z

@tlrmchlsmth @LucasWilkinson CI is done.

Signed-off-by: elvircrn <[email protected]>

Signed-off-by: elvircrn <[email protected]> Signed-off-by: bbartels <[email protected]>

Signed-off-by: elvircrn <[email protected]>

Signed-off-by: elvircrn <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: elvircrn <[email protected]>

Signed-off-by: elvircrn <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

elvircrn requested review from WoosukKwon, tlrmchlsmth and yewentao256 as code owners September 1, 2025 15:57

mergify bot added the performance Performance-related issues label Sep 1, 2025

gemini-code-assist bot reviewed Sep 1, 2025

View reviewed changes

elvircrn marked this pull request as draft September 1, 2025 16:05

robertgshaw2-redhat changed the title ~~Squashed cuda silu changes~~ [Kernels][DP/EP] Optimize Silu Kernel for R1 Sep 2, 2025

elvircrn force-pushed the cuda_silu_cherry_picked branch 2 times, most recently from 386a373 to f7820e5 Compare September 3, 2025 11:28

elvircrn marked this pull request as ready for review September 3, 2025 12:40

elvircrn force-pushed the cuda_silu_cherry_picked branch from 1a43afc to 7d9ad60 Compare September 3, 2025 12:43

CUDA replacement implmementation for silu

49aab19

Signed-off-by: elvircrn <[email protected]>

elvircrn force-pushed the cuda_silu_cherry_picked branch from b9f6303 to 49aab19 Compare September 3, 2025 16:52

elvircrn added 10 commits September 3, 2025 21:04

Improve thread distribution

f230fe1

Signed-off-by: elvircrn <[email protected]>

Fully vectorize to fp8x4

9934501

Signed-off-by: elvircrn <[email protected]>

Move to bf16

635efa0

Signed-off-by: elvircrn <[email protected]>

Thread tuning and bf16 switch

6e0e4bd

Signed-off-by: elvircrn <[email protected]>

Adapt tests to bf16

1c1595e

Signed-off-by: elvircrn <[email protected]>

constexpr fp8max/min and eps

8533071

Signed-off-by: elvircrn <[email protected]>

Improve codegen and constexpr everything

dec4fc2

Signed-off-by: elvircrn <[email protected]>

Silu benchmark and test clean-up

13fc8f8

Signed-off-by: elvircrn <[email protected]>

Add more tests

0ccb64d

Signed-off-by: elvircrn <[email protected]>

Add silu cuda type checks for input/output

84a9fdf

Signed-off-by: elvircrn <[email protected]>

elvircrn closed this Sep 11, 2025

elvircrn reopened this Sep 11, 2025

elvircrn added 8 commits September 11, 2025 17:58

Suround with compile guards

6bbf6cf

Signed-off-by: elvircrn <[email protected]>

Attempt to get a working AMD

1c0e05b

Signed-off-by: elvircrn <[email protected]>

Another attempt at making AMD compile

f5fdb0d

Signed-off-by: elvircrn <[email protected]>

Totally remove silu kernel for amd

4a62f6e

Signed-off-by: elvircrn <[email protected]>

Totally remove silu kernel for amd

4a84d06

Signed-off-by: elvircrn <[email protected]>

Totally remove silu kernel for amd

77e6659

Signed-off-by: elvircrn <[email protected]>

Totally remove silu kernel for amd

12926c8

Signed-off-by: elvircrn <[email protected]>

Totally remove silu kernel for amd

e8e8254

Signed-off-by: elvircrn <[email protected]>

tlrmchlsmth enabled auto-merge (squash) September 11, 2025 21:38

elvircrn added 2 commits September 12, 2025 00:35

Merge branch 'main' into cuda_silu_cherry_picked

1b165cd

Merge branch 'main' into cuda_silu_cherry_picked

13aab69

vllm-bot merged commit 98229db into vllm-project:main Sep 13, 2025
72 of 74 checks passed

dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025

[Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054)

998fc9f

Signed-off-by: elvircrn <[email protected]>

bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025

[Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054)

2330555

Signed-off-by: elvircrn <[email protected]> Signed-off-by: bbartels <[email protected]>

tlrmchlsmth mentioned this pull request Sep 16, 2025

[Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM #23693

Merged

hl475 mentioned this pull request Sep 19, 2025

Fix AMD build issues #25280

Closed

5 tasks

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054)

68c0fff

Signed-off-by: elvircrn <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054)

e41b7e6

Signed-off-by: elvircrn <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054)

0c86173

Signed-off-by: elvircrn <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054)

0a8ec71

Signed-off-by: elvircrn <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Uh oh!

[Kernels][DP/EP] Optimize Silu Kernel for R1 #24054

[Kernels][DP/EP] Optimize Silu Kernel for R1 #24054

Conversation

elvircrn commented Sep 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

elvircrn Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

elvircrn Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Sep 3, 2025

Uh oh!

robertgshaw2-redhat commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson commented Sep 3, 2025

Uh oh!

elvircrn commented Sep 3, 2025

Uh oh!

elvircrn commented Sep 3, 2025

Uh oh!

elvircrn commented Sep 5, 2025

Uh oh!

gemini-code-assist bot commented Sep 5, 2025

Uh oh!

elvircrn commented Sep 12, 2025

Uh oh!

tlrmchlsmth commented Sep 12, 2025

Uh oh!

elvircrn commented Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

elvircrn commented Sep 1, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat commented Sep 3, 2025 •

edited

Loading