[ROCm][Aiter] Add triton fp8 bmm kernel for mla #23264

divakar-amd · 2025-08-20T14:58:14Z

Purpose

Replace torch.bmm with aiter batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant kernel for MLA

Test Plan

Correctness check
Perf check

Test Result

Correctness Result

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    ' Christian Munoz and\nthis is my blog where I cover different\nIT topics'
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' in charge of which branch of government? A. judicial B. legislative C.'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' Paris. Paris is located along the Seine River in the north-central part of the'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    " a fascinating and rapidly evolving field. Here's a glimpse into some key areas shaping"
------------------------------------------------------------

Performance test on DeepSeek-R1 with full-cudagraph capture mode

REQUEST_RATES=(1 5 7 9)
TOTAL_SECONDS=20 
TP=8
OUTPUT_LEN=128

DATASET_PATH="ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json"
PYTHON_BENCH_SCRIPT="benchmarks/benchmark_serving.py"

   VLLM_USE_V1=1 vllm serve $MODEL \
        --port 8004 \
        --tensor-parallel-size $TP \
        --max-num-seqs 256 \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --disable-uvicorn-access-log \
        --block-size 1 \
        -O '{"full_cuda_graph":true}'

        python3 $PYTHON_BENCH_SCRIPT \
            --model $MODEL \
            --percentile-metrics ttft,tpot,itl,e2el \
            --dataset-path $DATASET_PATH \
            --request-rate $REQUEST_RATE \
            --num-prompts $(($TOTAL_SECONDS * $REQUEST_RATE)) \
            --ignore-eos \
            --port 8004 \
            --sharegpt-output-len $OUTPUT_LEN

Performance with and without this kernel

Pre-compilation for the kernel

If the kernel is not pre-compiled, the graph below shows the difference in performance when the kernel is run for the first time (aiter_BMM_run1) -vs- when the subsequent run (aiter_BMM_run2). Adding a pre-compilation step during weight loading resolves this issue.

Signed-off-by: Divakar Verma <[email protected]>

github-actions · 2025-08-20T14:58:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces a new Triton kernel for FP8 batched matrix multiplication (bmm) within the MLA backend on ROCm, aimed at improving performance. The changes include adding a new environment variable VLLM_ROCM_USE_AITER_FP8BMM to control this feature, and conditionally using the new kernel in place of torch.bmm for specific operations in the MLA implementation.

My review has identified one critical issue: the new environment variable is not included in the computation graph hash, which can lead to incorrect caching behavior. This must be addressed to ensure correctness.

vllm/envs.py

Signed-off-by: Divakar Verma <[email protected]>

SageMoore · 2025-08-25T15:45:23Z

vllm/v1/attention/backends/mla/common.py

+            # triton kernel to avoid runtime compilation for unseen batch sizes
+            # Pre-compile for batch sizes 1 to 1024 to cover most use-cases.
+            # On DS-R1, this step adds roughly 50s to the model loading time.
+            max_batch_size = 1024  # [ToDo] Find the optimal upper limit


I'm not crazy about adding this much overhead to the model loading time. CC @mgoin @LucasWilkinson I don't know how much pre-compilation we consider "acceptable". @divakar-amd how much does this improve performance?

Without pre-compilation, the performance is worse than torch.bmm for the first run. I added a plot above showing the performance difference if pre-compilation is not used. @SageMoore
Also, pre-compilation will add to the model loading time only if AITER is enabled.

I think this is generally fine. As you pointed out offline we do precompilation for other AITER kernels already and that takes less time than torch.compile, which seems like a reasonable upper bound. I do agree that the TTFT improvements are nice.

Signed-off-by: Divakar Verma <[email protected]> Co-authored-by: ShaoChunLee <[email protected]>

k50112113 and others added 8 commits August 12, 2025 20:21

add fused fp8 bmm

6d7840c

Signed-off-by: Divakar Verma <[email protected]>

add envs

92e134a

Signed-off-by: Divakar Verma <[email protected]>

api fix for upstream compatibility

9433b84

Signed-off-by: Divakar Verma <[email protected]>

improve env switch. reformat lint

245f2eb

Signed-off-by: Divakar Verma <[email protected]>

nit: formatting and direct aiter fxn call

6fd99d2

Signed-off-by: Divakar Verma <[email protected]>

fit fp8 dtype selection

c219220

Signed-off-by: Divakar Verma <[email protected]>

rm kernel warmup

8017d7d

Signed-off-by: Divakar Verma <[email protected]>

Merge branch 'vllm-project:main' into aiter_bmm_fp8

b0ae857

divakar-amd requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 20, 2025 14:58

mergify bot added rocm Related to AMD ROCm v1 labels Aug 20, 2025

gemini-code-assist bot reviewed Aug 20, 2025

View reviewed changes

vllm/envs.py Show resolved Hide resolved

divakar-amd and others added 3 commits August 20, 2025 10:05

add AITER FP8BMM to env hash list

8193c6d

Signed-off-by: Divakar Verma <[email protected]>

add pre-compilation for the triton BMM kernel

e8fe370

Signed-off-by: Divakar Verma <[email protected]>

Merge branch 'main' into aiter_bmm_fp8

87a5239

SageMoore reviewed Aug 25, 2025

View reviewed changes

SageMoore approved these changes Aug 26, 2025

View reviewed changes

divakar-amd added 2 commits August 26, 2025 11:28

Merge branch 'main' into aiter_bmm_fp8

63a4701

Merge branch 'main' into aiter_bmm_fp8

06a953a

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 27, 2025

divakar-amd mentioned this pull request Aug 27, 2025

[ROCm][Aiter] Add triton fp8 bmm kernel for mla #22759

Closed

2 tasks

gshtras enabled auto-merge (squash) August 28, 2025 16:17

Merge branch 'main' into aiter_bmm_fp8

92bf5c8

gshtras approved these changes Aug 28, 2025

View reviewed changes

gshtras merged commit 04d1dd7 into vllm-project:main Aug 28, 2025
38 of 39 checks passed

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[ROCm][Aiter] Add triton fp8 bmm kernel for mla (vllm-project#23264)

d0ae038

Signed-off-by: Divakar Verma <[email protected]> Co-authored-by: ShaoChunLee <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[ROCm][Aiter] Add triton fp8 bmm kernel for mla (vllm-project#23264)

0979b86

Signed-off-by: Divakar Verma <[email protected]> Co-authored-by: ShaoChunLee <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm][Aiter] Add triton fp8 bmm kernel for mla #23264

[ROCm][Aiter] Add triton fp8 bmm kernel for mla #23264

Uh oh!

divakar-amd commented Aug 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

SageMoore Aug 25, 2025

Uh oh!

divakar-amd Aug 25, 2025 •

edited

Loading

Uh oh!

SageMoore Aug 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[ROCm][Aiter] Add triton fp8 bmm kernel for mla #23264

[ROCm][Aiter] Add triton fp8 bmm kernel for mla #23264

Uh oh!

Conversation

divakar-amd commented Aug 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Correctness Result

Performance test on DeepSeek-R1 with full-cudagraph capture mode

Pre-compilation for the kernel

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

SageMoore Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

divakar-amd Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SageMoore Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

divakar-amd commented Aug 20, 2025 •

edited by github-actions bot

Loading

divakar-amd Aug 25, 2025 •

edited

Loading