[ROCm][Aiter] Add triton fp8 bmm kernel for mla #22759

divakar-amd · 2025-08-12T19:36:22Z

Purpose

Replace torch.bmm with aiter batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant kernel for MLA

Test Plan

Correctness check
Perf check

Test Result

Correctness Result

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    ' Christian Munoz and\nthis is my blog where I cover different\nIT topics'
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' in charge of which branch of government? A. judicial B. legislative C.'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' Paris. Paris is located along the Seine River in the north-central part of the'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    " a fascinating and rapidly evolving field. Here's a glimpse into some key areas shaping"
------------------------------------------------------------

Performance test on DeepSeek-R1 with full-cudagraph capture mode

REQUEST_RATES=(1 5 7 9)
TOTAL_SECONDS=20 
TP=8
OUTPUT_LEN=128

DATASET_PATH="ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json"
PYTHON_BENCH_SCRIPT="benchmarks/benchmark_serving.py"

   VLLM_USE_V1=1 vllm serve $MODEL \
        --port 8004 \
        --tensor-parallel-size $TP \
        --max-num-seqs 256 \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --disable-uvicorn-access-log \
        --block-size 1 \
        -O '{"full_cuda_graph":true}'

        python3 $PYTHON_BENCH_SCRIPT \
            --model $MODEL \
            --percentile-metrics ttft,tpot,itl,e2el \
            --dataset-path $DATASET_PATH \
            --request-rate $REQUEST_RATE \
            --num-prompts $(($TOTAL_SECONDS * $REQUEST_RATE)) \
            --ignore-eos \
            --port 8004 \
            --sharegpt-output-len $OUTPUT_LEN

resultPlot_basecudagraph-vs-aiterbmm_cudagraph_Median

(Optional) Documentation Update

github-actions · 2025-08-12T19:36:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces a Triton FP8 BMM kernel for MLA on ROCm to enhance performance. The changes include adding an environment variable to toggle this feature and updating the MLA attention implementation to leverage the new kernel for matrix multiplications. The implementation also involves quantizing weights to FP8 and includes a warmup loop for the Triton kernel. My review has identified two high-severity issues: the use of a hardcoded FP8 dtype instead of a platform-specific one, and a hardcoded warmup range for the Triton kernel that is insufficient for default configurations, potentially leading to performance degradation.

vllm/v1/attention/backends/mla/common.py

Signed-off-by: Divakar Verma <[email protected]>

Signed-off-by: Xiaozhu <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: NickLucche <[email protected]>

Signed-off-by: Zifei Tong <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>

Signed-off-by: RUTHLESS-BOT <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…t#22606) Signed-off-by: frankwang28 <[email protected]>

…t#22749) Signed-off-by: Harry Mellor <[email protected]>

…project#22673) Signed-off-by: Harry Mellor <[email protected]>

Signed-off-by: Jee Jee Li <[email protected]>

Signed-off-by: Michael Goin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]>

mergify · 2025-08-20T14:47:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @divakar-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

divakar-amd · 2025-08-27T17:37:34Z

Re-created another PR with some updates: #23264

divakar-amd requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 12, 2025 19:36

mergify bot added rocm Related to AMD ROCm v1 labels Aug 12, 2025

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

k50112113 and others added 6 commits August 12, 2025 20:21

add fused fp8 bmm

6d7840c

Signed-off-by: Divakar Verma <[email protected]>

add envs

92e134a

Signed-off-by: Divakar Verma <[email protected]>

api fix for upstream compatibility

9433b84

Signed-off-by: Divakar Verma <[email protected]>

improve env switch. reformat lint

245f2eb

Signed-off-by: Divakar Verma <[email protected]>

nit: formatting and direct aiter fxn call

6fd99d2

Signed-off-by: Divakar Verma <[email protected]>

fit fp8 dtype selection

c219220

Signed-off-by: Divakar Verma <[email protected]>

divakar-amd force-pushed the aiter_bmm_fp8 branch from f881a40 to c219220 Compare August 12, 2025 20:21

tjtanaa mentioned this pull request Aug 8, 2025

[Feature] [ROCm]: AITER Kernel Integration #14964

Open

61 tasks

rm kernel warmup

8017d7d

Signed-off-by: Divakar Verma <[email protected]>

SageMoore approved these changes Aug 19, 2025

View reviewed changes

mxz297 and others added 10 commits August 20, 2025 09:38

[CI][Nixl] Check kv cache layout during handshake (vllm-project#22745)

380030f

Signed-off-by: NickLucche <[email protected]>

Fix torch version check for SM100 mxfp4 (vllm-project#22535)

d453c1c

Signed-off-by: Zifei Tong <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>

[Misc] parametrize 'dtype' in test_flash_mla (vllm-project#22641)

e56ec0d

Signed-off-by: RUTHLESS-BOT <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues (vllm-projec…

5572b49

…t#22606) Signed-off-by: frankwang28 <[email protected]>

[Docs] Hide the navigation and toc sidebars on home page (vllm-projec…

194a9fa

…t#22749) Signed-off-by: Harry Mellor <[email protected]>

Fix Transformers backend tensor parallel for multimodal models (vllm-…

b4bcf2b

…project#22673) Signed-off-by: Harry Mellor <[email protected]>

[Model] Decouple glm4v (vllm-project#22751)

3eca03b

Signed-off-by: Jee Jee Li <[email protected]>

Add hardware plugins to installation doc (vllm-project#22732)

e8b1986

Signed-off-by: Michael Goin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>

[V0 Deprecation] Remove multi-step scheduling (vllm-project#22138)

a48314c

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]>

divakar-amd requested review from LucasWilkinson, ProExpertProg, houseroad, jeejeelee, mgoin, zhuohan123 and zou3519 as code owners August 20, 2025 14:46

github-project-automation bot added this to Structured Output Aug 20, 2025

mergify bot added tpu Related to Google TPUs tool-calling labels Aug 20, 2025

mergify bot added the needs-rebase label Aug 20, 2025

github-project-automation bot added this to Tool Calling Aug 20, 2025

hmellor closed this Aug 20, 2025

github-project-automation bot moved this to Done in Structured Output Aug 20, 2025

github-project-automation bot moved this to Done in Tool Calling Aug 20, 2025

vllmellm mentioned this pull request Aug 26, 2025

[Feature] [ROCm]: AITER Kernel Integration vllmellm/vllm#51

Open

61 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm][Aiter] Add triton fp8 bmm kernel for mla #22759

[ROCm][Aiter] Add triton fp8 bmm kernel for mla #22759

Uh oh!

divakar-amd commented Aug 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Aug 20, 2025

Uh oh!

divakar-amd commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

115 participants

Uh oh!

[ROCm][Aiter] Add triton fp8 bmm kernel for mla #22759

[ROCm][Aiter] Add triton fp8 bmm kernel for mla #22759

Uh oh!

Conversation

divakar-amd commented Aug 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Correctness Result

Performance test on DeepSeek-R1 with full-cudagraph capture mode

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Aug 20, 2025

Uh oh!

divakar-amd commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

115 participants

divakar-amd commented Aug 12, 2025 •

edited by github-actions bot

Loading