Skip to content

Conversation

vllmellm
Copy link
Contributor

@vllmellm vllmellm commented Oct 2, 2025

Purpose

To significantly improve the performance of top-k/top-p sampling on AMD ROCm GPUs, this PR integrates the optimized aiter sampling operator, replacing the default PyTorch native implementation.

The new aiter path is activated when the VLLM_ROCM_USE_AITER_SAMPLER environment variable is set to 1. The core changes are located in vllm/v1/sample/ops/topk_topp_sampler.py.

Validated on aiter commit: 6b586ae

Test Plan

Comprehensive correctness and performance tests were conducted on AMD MI300X GPUs to validate the feat and analyze its performance impact.

Server Configuration
The aiter sampling kernel on the ROCm platform is controlled by the VLLM_ROCM_USE_AITER master switch. The behavior is as follows:

PyTorch Native (Default): The PyTorch native implementation is used by default. This occurs if the VLLM_ROCM_USE_AITER environment variable is not set, or is explicitly set to 0.

Aiter Sampling (Enabled): To enable the optimized aiter sampler, the environment variable must be set:
export VLLM_ROCM_USE_AITER=1

Part A Correctness Test (lm_eval):
Used the lm_eval suite to test the gsm8k task.
Goal: Further verify the accuracy of the new operator

Part B Performance Test (benchserver)
Used an vLLM benchmark script to end-to-end conduct stress tests.
Goal: Compare throughput (tok/s) and latency (TPOT) between the pytorch native and the aiter sampling.

The following command was used, with sampling flags (--top-p, --top-k) adjusted for each scenario:

Server startup:

export TORCHDYNAMO_DISABLE=1
export VLLM_USE_V1=1
# aiter
export VLLM_ROCM_USE_AITER=1
# pytorch native
export VLLM_ROCM_USE_AITER=0 or no set

export HIP_VISIBLE_DEVICES=1,2
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --swap-space 16 \
    --disable-log-requests \
    --distributed-executor-backend mp \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 128 \
    --max-num-batched-tokens 131072 \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    --port 8001

Client startup:

top-p example

vllm bench serve \
    --backend vllm \
    --model "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8" \
    --dataset-name random \
    --random-input-len 1000 \
    --random-output-len 1000 \
    --num-prompts 640 \
    --max-concurrency 64 \
    --save-result \
    --trust-remote-code \
    --port 8001 \
    --endpoint /v1/completions \
    --top-p 0.95 \
    --top-k 0 \
    --temperature 0.1

top-k and top p+k can be adjusted flexibly

Test Result

Part A: Using lm_eval test
We conducted detailed tests on AMD MI300X GPUs, using the Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 model, and performed three-way comparison tests of top-p, top-k, and top-p+k for PyTorch native and aiter sampling, as shown in the following figure, and then Figure 1

Screenshot 2025-10-04 at 22 58 57

Accuracy (gsm8k)
The accuracy difference between aiter and pytorch native is negligible, falling within the expected range of variance for different low-level sampling implementations.

Part B: Using vLLM Benchmarks
At the same time, we also conducted end-to-end testing and saw significant performance improvements.

Screenshot 2025-10-04 at 22 59 10

The aiter operator provides a substantial performance uplift. On average, it achieves an ~1.6x increase in total throughput and a ~50% reduction in Time To First Token (TTFT).

@mergify mergify bot added the v1 label Oct 2, 2025
@mergify
Copy link

mergify bot commented Oct 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 2, 2025
@vllmellm vllmellm force-pushed the feat/rocm-aiter-sampler branch from 16ce99f to c3c7958 Compare October 2, 2025 09:52
@mergify mergify bot removed the needs-rebase label Oct 2, 2025
@vllmellm vllmellm marked this pull request as ready for review October 3, 2025 02:41
@vllmellm vllmellm changed the title Feat/ integrate aiter sampling ops [FEAT] [AITER] [ROCm] integrate aiter sampling ops Oct 3, 2025
@mergify mergify bot added the rocm Related to AMD ROCm label Oct 3, 2025
vllm/envs.py Outdated
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM: bool = False
VLLM_ROCM_USE_TRITON_ROPE: bool = False
VLLM_ROCM_USE_AITER_FP8BMM: bool = True
VLLM_ROCM_USE_AITER_SAMPLER: bool = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like us to stop adding an environment variable every time we include a new aiter kernel. If you think this kernel is broken outside of some specific cases, then we can discuss that, but let's not just add one by default.

try:
import importlib

importlib.import_module("aiter.ops.sampling")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea here. We only support one aiter version, the version that's in the ROCM docker file. Let's remove this try catch and just import the kernel and run it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We will reuse VLLM_ROCM_USE_AITER for this feature. And, this PR requires upgrade of aiter commit. The version in the ROCm dockerfile does not support this feature.

CC. @gshtras

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SageMoore , We reuse VLLM_ROCM_USE_AITER for this feature and Modified the code based on feedback.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AITER commits from the main branch currently can not be used due to GPT-OSS compatibility

@vllmellm vllmellm force-pushed the feat/rocm-aiter-sampler branch 3 times, most recently from 9aa523e to 82bdfba Compare October 7, 2025 05:01
@vllmellm vllmellm force-pushed the feat/rocm-aiter-sampler branch from 82bdfba to 8c17184 Compare October 7, 2025 05:18
@hongxiayang hongxiayang self-assigned this Oct 13, 2025
@mergify
Copy link

mergify bot commented Oct 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase rocm Related to AMD ROCm v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants