[ROCm][torch.compile] Adding ROCm-specific fusion pass for integrating aiter act/rms MXFP4 operators #25860

xytpai · 2025-09-29T06:56:35Z

This PR adds few fusion passes for silu_mul_quant_mxfp4 and add_rmsnorm_quant_mxfp4
~ 2% Perf gain

gemini-code-assist

Code Review

This pull request introduces fusion passes for ROCm to optimize silu_mul operations with mxfp4 quantization. The changes include adding a new ROCmFusionPass, refactoring gemm_with_dynamic_quant to be compatible with torch.compile, and implementing new fused custom ops. The overall approach is sound and follows existing patterns for fusion passes in vLLM. I've found one issue related to incorrect type hints in a fake implementation function, which could cause problems with torch.compile.

gemini-code-assist · 2025-09-29T06:58:15Z

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

    def gemm_with_dynamic_quant_fake(
+        result: torch.Tensor,
        x: torch.Tensor,
        weight: torch.Tensor,
        weight_scale: torch.Tensor,
        x_scales: torch.Tensor = None,
        rocm_use_aiter_fp4_asm_gemm: bool = False,
        out_dtype: Optional[torch.dtype] = torch.bfloat16,
    ) -> torch.Tensor:
-        return torch.empty((*x.shape[:-1], weight.shape[0]),
-                           dtype=out_dtype,
-                           device=x.device)
+        return


The type hints for gemm_with_dynamic_quant_fake are incorrect and do not match the function's implementation or its real counterpart gemm_with_dynamic_quant.

The x_scales parameter is type-hinted as torch.Tensor but has a default value of None. The type hint should be Optional[torch.Tensor] to reflect this.

The function is type-hinted to return torch.Tensor, but it implicitly returns None. The return type hint should be None.

These inconsistencies can cause issues with static type checkers and torch.compile's fake tensor propagation, which relies on correct function signatures for its analysis.

Suggested change

def gemm_with_dynamic_quant_fake(

result: torch.Tensor,

x: torch.Tensor,

weight: torch.Tensor,

weight_scale: torch.Tensor,

x_scales: torch.Tensor = None,

rocm_use_aiter_fp4_asm_gemm: bool = False,

out_dtype: Optional[torch.dtype] = torch.bfloat16,

) -> torch.Tensor:

return torch.empty((*x.shape[:-1], weight.shape[0]),

dtype=out_dtype,

device=x.device)

return

def gemm_with_dynamic_quant_fake(

result: torch.Tensor,

x: torch.Tensor,

weight: torch.Tensor,

weight_scale: torch.Tensor,

x_scales: Optional[torch.Tensor] = None,

rocm_use_aiter_fp4_asm_gemm: bool = False,

out_dtype: Optional[torch.dtype] = torch.bfloat16,

) -> None:

return

mergify · 2025-10-03T08:36:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xytpai.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg

Instead of creating a new pass, can we add these patterns to the existing passes? Also please wait for #24604 which will add better pattern matching utilities

add act_quant4 fusion pass

fb0612e

xytpai requested review from ProExpertProg, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners September 29, 2025 06:56

mergify bot added the rocm Related to AMD ROCm label Sep 29, 2025

xytpai marked this pull request as draft September 29, 2025 06:58

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

add rmsnorm_quant4 fusion

acbcb57

xytpai changed the title ~~[ROCm][torch.compile] Add act_mxfp4 fusion pass~~ [ROCm][torch.compile] Adding ROCm-specific fusion pass for integrating aiter act/rms MXFP4 operators Sep 29, 2025

xytpai marked this pull request as ready for review September 29, 2025 07:50

xytpai mentioned this pull request Sep 30, 2025

[355_wip] Let dynamo capture rms/silu_mul+f4gemm pattern ROCm/vllm#705

Open

mergify bot added the needs-rebase label Oct 3, 2025

ProExpertProg requested changes Oct 3, 2025

View reviewed changes

wuhuikx mentioned this pull request Oct 14, 2025

[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend #26768

Open

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm][torch.compile] Adding ROCm-specific fusion pass for integrating aiter act/rms MXFP4 operators #25860

[ROCm][torch.compile] Adding ROCm-specific fusion pass for integrating aiter act/rms MXFP4 operators #25860

xytpai commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[ROCm][torch.compile] Adding ROCm-specific fusion pass for integrating aiter act/rms MXFP4 operators #25860

Are you sure you want to change the base?

[ROCm][torch.compile] Adding ROCm-specific fusion pass for integrating aiter act/rms MXFP4 operators #25860

Conversation

xytpai commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xytpai commented Sep 29, 2025 •

edited by github-actions bot

Loading