[FEAT] [AITER] [ROCm] integrate aiter sampling ops #26084

vllmellm · 2025-10-02T08:02:15Z

Purpose

To significantly improve the performance of top-k/top-p sampling on AMD ROCm GPUs, this PR integrates the optimized aiter sampling operator, replacing the default PyTorch native implementation.

The new aiter path is activated when the VLLM_ROCM_USE_AITER_SAMPLER environment variable is set to 1. The core changes are located in vllm/v1/sample/ops/topk_topp_sampler.py.

Validated on aiter commit: 6b586ae

Test Plan

Comprehensive correctness and performance tests were conducted on AMD MI300X GPUs to validate the feat and analyze its performance impact.

Server Configuration
The aiter sampling kernel on the ROCm platform is controlled by the VLLM_ROCM_USE_AITER master switch. The behavior is as follows:

PyTorch Native (Default): The PyTorch native implementation is used by default. This occurs if the VLLM_ROCM_USE_AITER environment variable is not set, or is explicitly set to 0.

Aiter Sampling (Enabled): To enable the optimized aiter sampler, the environment variable must be set:
export VLLM_ROCM_USE_AITER=1

Part A Correctness Test (lm_eval):
Used the lm_eval suite to test the gsm8k task.
Goal: Further verify the accuracy of the new operator

Part B Performance Test (benchserver)
Used an vLLM benchmark script to end-to-end conduct stress tests.
Goal: Compare throughput (tok/s) and latency (TPOT) between the pytorch native and the aiter sampling.

The following command was used, with sampling flags (--top-p, --top-k) adjusted for each scenario:

Server startup:

export TORCHDYNAMO_DISABLE=1
export VLLM_USE_V1=1
# aiter
export VLLM_ROCM_USE_AITER=1
# pytorch native
export VLLM_ROCM_USE_AITER=0 or no set

export HIP_VISIBLE_DEVICES=1,2
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --swap-space 16 \
    --disable-log-requests \
    --distributed-executor-backend mp \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 128 \
    --max-num-batched-tokens 131072 \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    --port 8001

Client startup:

top-p example

vllm bench serve \
    --backend vllm \
    --model "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8" \
    --dataset-name random \
    --random-input-len 1000 \
    --random-output-len 1000 \
    --num-prompts 640 \
    --max-concurrency 64 \
    --save-result \
    --trust-remote-code \
    --port 8001 \
    --endpoint /v1/completions \
    --top-p 0.95 \
    --top-k 0 \
    --temperature 0.1

top-k and top p+k can be adjusted flexibly

Test Result

Part A: Using lm_eval test
We conducted detailed tests on AMD MI300X GPUs, using the Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 model, and performed three-way comparison tests of top-p, top-k, and top-p+k for PyTorch native and aiter sampling, as shown in the following figure, and then Figure 1

Accuracy (gsm8k)
The accuracy difference between aiter and pytorch native is negligible, falling within the expected range of variance for different low-level sampling implementations.

Part B: Using vLLM Benchmarks
At the same time, we also conducted end-to-end testing and saw significant performance improvements.

The aiter operator provides a substantial performance uplift. On average, it achieves an ~1.6x increase in total throughput and a ~50% reduction in Time To First Token (TTFT).

mergify · 2025-10-02T08:02:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

SageMoore · 2025-10-03T19:41:06Z

vllm/envs.py

    VLLM_ROCM_USE_AITER_FP4_ASM_GEMM: bool = False
    VLLM_ROCM_USE_TRITON_ROPE: bool = False
    VLLM_ROCM_USE_AITER_FP8BMM: bool = True
+    VLLM_ROCM_USE_AITER_SAMPLER: bool = True


I would really like us to stop adding an environment variable every time we include a new aiter kernel. If you think this kernel is broken outside of some specific cases, then we can discuss that, but let's not just add one by default.

SageMoore · 2025-10-03T19:42:55Z

vllm/v1/sample/ops/topk_topp_sampler.py

+            try:
+                import importlib
+
+                importlib.import_module("aiter.ops.sampling")


Same idea here. We only support one aiter version, the version that's in the ROCM docker file. Let's remove this try catch and just import the kernel and run it.

Sure. We will reuse VLLM_ROCM_USE_AITER for this feature. And, this PR requires upgrade of aiter commit. The version in the ROCm dockerfile does not support this feature.

CC. @gshtras

Hi @SageMoore , We reuse VLLM_ROCM_USE_AITER for this feature and Modified the code based on feedback.

AITER commits from the main branch currently can not be used due to GPT-OSS compatibility

Signed-off-by: vllmellm <[email protected]>

mergify · 2025-10-14T04:34:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the v1 label Oct 2, 2025

mergify bot added the needs-rebase label Oct 2, 2025

vllmellm force-pushed the feat/rocm-aiter-sampler branch from 16ce99f to c3c7958 Compare October 2, 2025 09:52

mergify bot removed the needs-rebase label Oct 2, 2025

vllmellm marked this pull request as ready for review October 3, 2025 02:41

vllmellm requested review from 22quinn, houseroad and njhill as code owners October 3, 2025 02:41

vllmellm changed the title ~~Feat/ integrate aiter sampling ops~~ [FEAT] [AITER] [ROCm] integrate aiter sampling ops Oct 3, 2025

mergify bot added the rocm Related to AMD ROCm label Oct 3, 2025

SageMoore reviewed Oct 3, 2025

View reviewed changes

vllmellm force-pushed the feat/rocm-aiter-sampler branch 3 times, most recently from 9aa523e to 82bdfba Compare October 7, 2025 05:01

vllmellm added 7 commits October 7, 2025 05:14

feat: Conflict resolution

43f1a75

Signed-off-by: vllmellm <[email protected]>

feat: add aiter sampling integrate to vLLM -1

70b4f06

Signed-off-by: vllmellm <[email protected]>

feat: add aiter sampling integrate to vLLM - 2

ea4d9b9

Signed-off-by: vllmellm <[email protected]>

feat: add aiter sampling integrate to vLLM -3

71bb9f0

Signed-off-by: vllmellm <[email protected]>

feat: add aiter sampling integrate to vLLM xxxxx

8f26628

Signed-off-by: vllmellm <[email protected]>

feat: add aiter sampling integrate to vLLM xxx-1

47bb002

Signed-off-by: vllmellm <[email protected]>

feat: add aiter sampling integrate to vLLM - xxxx

8c17184

Signed-off-by: vllmellm <[email protected]>

vllmellm force-pushed the feat/rocm-aiter-sampler branch from 82bdfba to 8c17184 Compare October 7, 2025 05:18

hongxiayang self-assigned this Oct 13, 2025

mergify bot added the needs-rebase label Oct 14, 2025

wuhuikx mentioned this pull request Oct 14, 2025

[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend #26768

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FEAT] [AITER] [ROCm] integrate aiter sampling ops #26084

[FEAT] [AITER] [ROCm] integrate aiter sampling ops #26084

vllmellm commented Oct 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 2, 2025

Uh oh!

SageMoore Oct 3, 2025

Uh oh!

SageMoore Oct 3, 2025

Uh oh!

vllmellm Oct 4, 2025

Uh oh!

vllmellm Oct 4, 2025

Uh oh!

gshtras Oct 7, 2025

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[FEAT] [AITER] [ROCm] integrate aiter sampling ops #26084

Are you sure you want to change the base?

[FEAT] [AITER] [ROCm] integrate aiter sampling ops #26084

Conversation

vllmellm commented Oct 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 2, 2025

Uh oh!

SageMoore Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

vllmellm Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

vllmellm Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vllmellm commented Oct 2, 2025 •

edited by github-actions bot

Loading