[Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs #16198

tjtanaa · 2025-04-07T16:32:49Z

Description

This PR fixes two bugs:

TypeError: ROCmFlashAttentionImpl.__init__() got an unexpected keyword argument 'use_irope'
Fix the topk_weights in invoke_fused_moe_kernel being not contiguous under V1 + ROCm + torch.compile + Dynamo + hipgraph mode.

Co-authored-by: Hongxia Yang <[email protected]> Signed-off-by: tjtanaa <[email protected]>

Signed-off-by: tjtanaa <[email protected]>

github-actions · 2025-04-07T16:32:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

hongxiayang · 2025-04-07T16:38:23Z

cc @houseroad

Signed-off-by: tjtanaa <[email protected]>

hongxiayang · 2025-04-07T16:56:27Z

@simon-mo @houseroad @SageMoore Can you help to merge this? This will unblock us from aiter integration for performance improvement. Thanks!

ProExpertProg · 2025-04-07T17:03:07Z

vllm/model_executor/layers/fused_moe/fused_moe.py

+        # with torch.compile Dynamo.
+        # V1 Engine on ROCm with eager mode is fine.
+        # V0 Engine on ROCm with HIPGraph is fine.
+        topk_weights = topk_weights.view(-1).reshape(topk_weights.shape)


Is it possible that this is related to Inductor always putting matrices in row-major order? And we should add a modifier to the custom op?

See comment in torch_bindings.cpp:

// The default behavior in PyTorch 2.6 is "requires_contiguous", so we need // to override this for many GEMMs with the following tag. Otherwise, // torch.compile will force all input tensors to be contiguous(), which // will break many custom ops that require column-major weight matrices. // TODO: remove this for PyTorch 2.8, when the default is planned to switch // to match exact eager-mode strides. at::Tag stride_tag = at::Tag::needs_fixed_stride_order;

I think the issue might not be related to inductor as it does not happen on CUDA.
the topk_weights.stride() on CUDA returns (1,1) but on ROCm returns (1,1024)

Can we create an issue to track this hack if @ProExpertProg's suggestion doesn't work

there is an issue created: ROCm/pytorch#2020

@ProExpertProg @houseroad
The topk_weights is generated using Llama4MoE.custom_routing_function which is just a series of native PyTorch operator. So, there is no custom ops involved in generating topk_weights.

vllm/vllm/model_executor/models/llama4.py

Line 44 in 027b204

class Llama4MoE(nn.Module):

class Llama4MoE(nn.Module): @staticmethod def custom_routing_function( hidden_states: torch.Tensor, gating_output: torch.Tensor, topk: int, renormalize: bool, ) -> Tuple[torch.Tensor, torch.Tensor]: router_scores, router_indices = torch.topk(gating_output, topk, dim=-1) router_scores = torch.sigmoid(router_scores.float()).to( hidden_states.dtype) return (router_scores, router_indices.to(torch.int32))

The router_scores is the topk_weights of the fused_moe input.

It's not about the custom op generating but about consuming a tensor, so the wna16 op consumes this tensor and it might get transposed, unless that IP is marked with the tag

@ProExpertProg
Thank you for the leads. It seems there is a way to add the tag through the PyTorch Python API as well. We have expose the tags interface through direct_register_custom_op from vllm/utils.py which is a functioned proposed by Kaichao to register custom ops that are not traceable by torch.compile. Adding the tags tags=(torch.Tag.needs_fixed_stride_order,), does resolve the issue.

direct_register_custom_op( op_name="inplace_fused_experts", op_func=inplace_fused_experts, mutates_args=["hidden_states"], fake_impl=inplace_fused_experts_fake, + tags=(torch.Tag.needs_fixed_stride_order,), )

Yep this looks right, great work!

SageMoore

This looks reasonable. Let's try @ProExpertProg's suggestion for fixing the topk_weights issue.

houseroad

Looks fine for unblocking now. We need to create 2 follow ups.

houseroad · 2025-04-07T19:14:41Z

vllm/model_executor/layers/fused_moe/fused_moe.py

+        # with torch.compile Dynamo.
+        # V1 Engine on ROCm with eager mode is fine.
+        # V0 Engine on ROCm with HIPGraph is fine.
+        topk_weights = topk_weights.view(-1).reshape(topk_weights.shape)


Can we create an issue to track this hack if @ProExpertProg's suggestion doesn't work

houseroad · 2025-04-07T19:16:49Z

vllm/attention/backends/rocm_flash_attn.py

            raise ValueError(
                "ROCmFlashAttention does not support blocksparse attention.")
-
+        if use_irope:


Create an issue to trace this progress?

i think the output will be incorrect with global attention

I remember it seems reasonable, but we should definitely have the right approach here.

Agreed about tracking this issue if we want to fully support V0. We will create one internally. Does that sound good to you?

Signed-off-by: kliuae <[email protected]>

Signed-off-by: tjtanaa <[email protected]>

ProExpertProg

Thanks for adding support for tags! LGTM assuming this tag fixed the original issue!

DarkLight1337

Stamp

…nImpl and Triton Fused MoE bugs (vllm-project#16198) Signed-off-by: tjtanaa <[email protected]> Signed-off-by: kliuae <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: kliuae <[email protected]> Signed-off-by: Yang Wang <[email protected]>

…nImpl and Triton Fused MoE bugs (vllm-project#16198) Signed-off-by: tjtanaa <[email protected]> Signed-off-by: kliuae <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: kliuae <[email protected]>

…nImpl and Triton Fused MoE bugs (vllm-project#16198) Signed-off-by: tjtanaa <[email protected]> Signed-off-by: kliuae <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: kliuae <[email protected]> Signed-off-by: Mu Huai <[email protected]>

tjtanaa and others added 2 commits April 7, 2025 15:55

Added the extra use_irope parameter in

c0d3784

Co-authored-by: Hongxia Yang <[email protected]> Signed-off-by: tjtanaa <[email protected]>

Fix ROCm V1 Engine Fused MoE Bug

2c2966c

Signed-off-by: tjtanaa <[email protected]>

tjtanaa marked this pull request as ready for review April 7, 2025 16:43

Add warning message that V0 do not support irope

4d71ebe

Signed-off-by: tjtanaa <[email protected]>

ProExpertProg reviewed Apr 7, 2025

View reviewed changes

SageMoore reviewed Apr 7, 2025

View reviewed changes

houseroad approved these changes Apr 7, 2025

View reviewed changes

This was referenced Apr 7, 2025

Upstream merge 2025 04 07 ROCm/vllm#503

Merged

Fix fused moe ROCm/vllm#505

Closed

Fix fused moe ROCm/vllm#506

Merged

kliuae and others added 2 commits April 8, 2025 10:30

Expose torch.Tag for tensor stride handling

c4fc335

Signed-off-by: kliuae <[email protected]>

fix linting issue

f8e76ec

Signed-off-by: tjtanaa <[email protected]>

tjtanaa force-pushed the fix-fused-moe branch from 07d4d3e to f8e76ec Compare April 8, 2025 10:31

ProExpertProg approved these changes Apr 8, 2025

View reviewed changes

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2025

SageMoore approved these changes Apr 8, 2025

View reviewed changes

DarkLight1337 approved these changes Apr 8, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) April 8, 2025 15:51

vllm-bot merged commit 2976dc2 into vllm-project:main Apr 9, 2025
59 of 63 checks passed

houseroad mentioned this pull request Apr 15, 2025

[Feature]: Llama4 Support Enhancement #16114

Closed

19 tasks

Uh oh!

[Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs #16198

[Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs #16198

Uh oh!

Conversation

tjtanaa commented Apr 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

hongxiayang commented Apr 7, 2025

Uh oh!

hongxiayang commented Apr 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houseroad Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

houseroad Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

tjtanaa commented Apr 7, 2025 •

edited by github-actions bot

Loading

houseroad Apr 7, 2025 •

edited

Loading

tjtanaa Apr 8, 2025 •

edited

Loading

houseroad Apr 7, 2025 •

edited

Loading