- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[Kernels][DP/EP] Optimize Silu Kernel for R1 #24054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernels][DP/EP] Optimize Silu Kernel for R1 #24054
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new CUDA kernel for silu_mul_fp8_quant to replace the existing Triton implementation, aiming for better performance. The changes include the new CUDA kernel, its C++ bindings, and updates to tests and benchmarks to compare against the old implementation, which is preserved as a baseline. While the overall approach is sound, the review identified several critical correctness issues in the new CUDA kernel related to parameter handling, as well as high-severity maintainability problems such as dead code, code duplication, and style violations. These issues should be addressed to ensure the correctness and long-term health of the codebase.
|  | ||
| // quant params | ||
| float fp8_min, float fp8_max) { | ||
| static constexpr float EPS = 1e-10; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel uses a hardcoded EPS value, ignoring the eps parameter passed to the host function silu_mul_fp8_quant_deep_gemm_cuda. This is a correctness bug. The eps value should be passed to this kernel as an argument and used instead of the hardcoded constant.
This will require changes in:
- The kernel signature to accept eps.
- The kernel body to use the epsargument (e.g.,float y_max = eps;on line 265).
- The host launcher silu_mul_fp8_quant_deep_gemm_cudato passepsto the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasWilkinson Is this OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should remove eps/fp8_max/fp8_min from the python API?
| 
 (APIServer pid=1) (EngineCore_0 pid=283) RuntimeError: Worker failed with error ''Keyword argument NUM_WARPS was specified but unrecognised'', please check the stack trace above for the root cause
 (APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]     a2q, a2q_scale = silu_mul_fp8_quant_deep_gemm(workspace1,
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]   File "/opt/vllm-source/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py", line 206, in silu_mul_fp8_quant_deep_gemm
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]     _silu_mul_fp8_quant_deep_gemm[grid](
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]   File "/opt/vllm/lib64/python3.12/site-packages/triton/runtime/jit.py", line 390, in <lambda>
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^It looks like the triton kernel is still hooked up to the actual layer. | 
| made branch on nm repo that hooks up the kernel: 376129e I still have an IMA: RuntimeError: CUDA driver error: an illegal memory access was encounteredIm trying to build it locally | 
386a373    to
    f7820e5      
    Compare
  
    1a43afc    to
    7d9ad60      
    Compare
  
    | 
 @elvircrn would a coredump help? https://blog.vllm.ai/2025/08/11/cuda-debugging.html @robertgshaw2-redhat would you be able to enable this environment variables? | 
| @LucasWilkinson Yes, a coredump would help, thanks! | 
| @robertgshaw2-redhat @LucasWilkinson Can H ever not be divisible by 128? | 
Signed-off-by: elvircrn <[email protected]>
b9f6303    to
    49aab19      
    Compare
  
    Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
| /gemini-review | 
| I've received your request for a new review. Please note that this is a command for the system to initiate a new review, and I will not be providing a conversational response here. The review process will begin shortly. | 
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
| 
 Used this for reference a5b84f1 | 
| 
 sounds good - let's wait for distributed-tests-2-gpus and entrypoints-integration-test-api-server to finish and then we can request a force merge | 
| @tlrmchlsmth @LucasWilkinson CI is done. | 
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]> Signed-off-by: bbartels <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
The purpose of this PR is to replace the triton silu implementation with a faster cuda version.
Here's a benchmark slice:
This was achieved by launching additional cuda blocks and parallelizing over T dimension. This means that we end up launching NOOP threads and that the parallelization factor is now an additional tunable parameter.
To understand the impact of the parallelization factor, see the following graphs for E <=9 :
For E=32, we have:




16X seems like a parallelization that works well for most configuration - this was chosen as the default.
Test Plan
Given
yof shape(E, T, 2H)as input the function is expected to work for allGROUP_SIZE=128andHdivisible by 128.Test Result
From main:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.