[ROCm][Perf] New design on ROCm AITER MHA backend Implementation #25763

ganyi1996ppo · 2025-09-26T11:50:52Z

Purpose

The current AiterFlashAttentionImpl fetches K/V every run, which creates unnecessary memory pressure and non-trivial latency—especially with long prompts. This PR:

Removes redundant KV fetches from the attention backend
Introduces a phase-aware execution (decode, pure prefill, chunk prefill) and reorders inputs to [decode:chunk_prefill:pure_prefill] for token-contiguous memory access.
Rewrites the “fetch KV” Triton kernel for better occupancy in chunked prefill and similar scenarios.

Design and implementation

Phase-aware path:

decode
chunk prefill (cp)
pure prefill (pp)

Input reordering to [decode:cp:pp] ensures tokens are contiguous in memory, improving kernel locality and occupancy. The reorder occurs in both Scheduler's scheduling phase and ModelRunner's state updating phase. We add this split_prefill_from_chunk to the SchedulerConfig to control this behavior, which will be turned on if both VLLM_ROCM_USE_AITER and VLLM_ROCM_USE_AITER_MHA are set.

Compared with the old one, this solution is more memory efficient and fast, especially on the long prompt scenario. Here is the Performance Measured on Qwen3, Mi308:

Long prompt, short output (2k prompt, 16 output): ~4.x throughput improvement.
Short prompt, long output (128 prompt, 1k output): ~2.x throughput improvement.
Extremely long prompt (192k prompt, 2k output): ~5.x throughput improvement.

Test Plan

acc : lm_eval test for accuracy verification
perf : vllm bench test

Test Result

2k prompt 16 output case:

old impl

new impl

128 prompt 1k output case:

old impl

new impl

acc verification

We test this PR on Qwen3-30B-A3B-FP8 on gsm8k with lm_eval, and here is the result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8097|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.8901|±  |0.0086|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ganyi <[email protected]>

github-actions · 2025-09-26T11:51:00Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a new, more performant MHA backend implementation for ROCm. The changes include removing redundant KV fetches, introducing phase-aware execution (decode, pure prefill, chunk prefill), reordering inputs for better memory access, and rewriting the Triton kernel for fetching KV cache. The performance improvements demonstrated are significant. However, I have identified a few critical bugs in the implementation that need to be addressed. These include incorrect scaling in a Triton kernel, an invalid tensor view operation that will lead to a runtime error, and a logical error in the batch reordering logic. Addressing these issues is crucial for the correctness and stability of the new backend.

vllm/v1/attention/backends/rocm_aiter_fa.py

vllm/v1/attention/backends/utils.py

Signed-off-by: ganyi <[email protected]>

wuhuikx · 2025-09-27T06:46:39Z

Could you please help clarify, which Qwen3 model and datatype are you using? Could you please also append the accuracy results?

vllm/v1/core/sched/scheduler.py

vllm/v1/attention/backends/utils.py

vllm/v1/attention/backends/rocm_aiter_fa.py

wuhuikx · 2025-09-28T02:06:16Z

cc @wuhuikx @sunway513

Signed-off-by: ganyi <[email protected]>

ganyi1996ppo · 2025-09-29T06:59:53Z

Could you please help clarify, which Qwen3 model and datatype are you using? Could you please also append the accuracy results?

Thanks for the suggestion, just update the PR description with model and accuracy verification.

ganyi1996ppo · 2025-09-29T07:01:56Z

hi @gshtras , can you please take a look on this PR.

refactor attention backend for perf boost

5302056

Signed-off-by: ganyi <[email protected]>

ganyi1996ppo requested review from gshtras, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, heheda12345, ApostaC, simon-mo, youkaichao, mgoin, tlrmchlsmth, houseroad, hmellor, yewentao256 and ProExpertProg as code owners September 26, 2025 11:50

mergify bot added rocm Related to AMD ROCm v1 labels Sep 26, 2025

ganyi1996ppo mentioned this pull request Sep 26, 2025

[Perf] refactor attention backend for perf boost ROCm/vllm#713

Open

5 tasks

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

vllm/v1/attention/backends/rocm_aiter_fa.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/rocm_aiter_fa.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/utils.py Outdated Show resolved Hide resolved

ganyi1996ppo changed the title ~~[ROCm][Perf] New design on MHA backend Implementation~~ [ROCm][Perf] New design on ROCm AITER MHA backend Implementation Sep 26, 2025

ganyi1996ppo and others added 3 commits September 26, 2025 11:55

fix some bugs

8580c2c

Signed-off-by: ganyi <[email protected]>

remove v0 kv cache layout

608cbd4

Signed-off-by: ganyi <[email protected]>

Merge branch 'main' into ganyi/refactor_attn_backend_impl_main

2dc0e66

wuhuikx suggested changes Sep 27, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/utils.py Outdated Show resolved Hide resolved

wuhuikx suggested changes Sep 27, 2025

View reviewed changes

vllm/v1/attention/backends/rocm_aiter_fa.py Outdated Show resolved Hide resolved

fix scheduler bug

13701e8

Signed-off-by: ganyi <[email protected]>

ganyi1996ppo requested a review from LucasWilkinson as a code owner September 28, 2025 03:44

ganyi1996ppo added 2 commits September 29, 2025 03:14

fix lint

a1b1fc6

Signed-off-by: ganyi <[email protected]>

fix lint

acfc7de

Signed-off-by: ganyi <[email protected]>

ganyi1996ppo requested a review from wuhuikx September 29, 2025 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm][Perf] New design on ROCm AITER MHA backend Implementation #25763

[ROCm][Perf] New design on ROCm AITER MHA backend Implementation #25763

ganyi1996ppo commented Sep 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuhuikx commented Sep 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuhuikx commented Sep 28, 2025 •

edited

Loading

Uh oh!

ganyi1996ppo commented Sep 29, 2025 •

edited

Loading

Uh oh!

ganyi1996ppo commented Sep 29, 2025

Uh oh!

Uh oh!

Uh oh!

[ROCm][Perf] New design on ROCm AITER MHA backend Implementation #25763

Are you sure you want to change the base?

[ROCm][Perf] New design on ROCm AITER MHA backend Implementation #25763

Conversation

ganyi1996ppo commented Sep 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Design and implementation

Test Plan

Test Result

2k prompt 16 output case:

128 prompt 1k output case:

acc verification

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuhuikx commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuhuikx commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganyi1996ppo commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganyi1996ppo commented Sep 29, 2025

Uh oh!

Uh oh!

ganyi1996ppo commented Sep 26, 2025 •

edited by github-actions bot

Loading

wuhuikx commented Sep 27, 2025 •

edited

Loading

wuhuikx commented Sep 28, 2025 •

edited

Loading

ganyi1996ppo commented Sep 29, 2025 •

edited

Loading