-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[FEAT] [AITER] [ROCm] integrate aiter sampling ops #26084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request has merge conflicts that must be resolved before it can be |
16ce99f
to
c3c7958
Compare
vllm/envs.py
Outdated
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM: bool = False | ||
VLLM_ROCM_USE_TRITON_ROPE: bool = False | ||
VLLM_ROCM_USE_AITER_FP8BMM: bool = True | ||
VLLM_ROCM_USE_AITER_SAMPLER: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like us to stop adding an environment variable every time we include a new aiter kernel. If you think this kernel is broken outside of some specific cases, then we can discuss that, but let's not just add one by default.
try: | ||
import importlib | ||
|
||
importlib.import_module("aiter.ops.sampling") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same idea here. We only support one aiter version, the version that's in the ROCM docker file. Let's remove this try catch and just import the kernel and run it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We will reuse VLLM_ROCM_USE_AITER
for this feature. And, this PR requires upgrade of aiter commit. The version in the ROCm dockerfile does not support this feature.
CC. @gshtras
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SageMoore , We reuse VLLM_ROCM_USE_AITER for this feature and Modified the code based on feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AITER commits from the main branch currently can not be used due to GPT-OSS compatibility
9aa523e
to
82bdfba
Compare
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
82bdfba
to
8c17184
Compare
This pull request has merge conflicts that must be resolved before it can be |
Purpose
To significantly improve the performance of top-k/top-p sampling on AMD ROCm GPUs, this PR integrates the optimized aiter sampling operator, replacing the default PyTorch native implementation.
The new aiter path is activated when the VLLM_ROCM_USE_AITER_SAMPLER environment variable is set to 1. The core changes are located in vllm/v1/sample/ops/topk_topp_sampler.py.
Validated on aiter commit:
6b586ae
Test Plan
Comprehensive correctness and performance tests were conducted on AMD MI300X GPUs to validate the feat and analyze its performance impact.
Server Configuration
The aiter sampling kernel on the ROCm platform is controlled by the VLLM_ROCM_USE_AITER master switch. The behavior is as follows:
PyTorch Native (Default): The PyTorch native implementation is used by default. This occurs if the VLLM_ROCM_USE_AITER environment variable is not set, or is explicitly set to 0.
Aiter Sampling (Enabled): To enable the optimized aiter sampler, the environment variable must be set:
export VLLM_ROCM_USE_AITER=1
Part A Correctness Test (lm_eval):
Used the lm_eval suite to test the gsm8k task.
Goal: Further verify the accuracy of the new operator
Part B Performance Test (benchserver)
Used an vLLM benchmark script to end-to-end conduct stress tests.
Goal: Compare throughput (tok/s) and latency (TPOT) between the pytorch native and the aiter sampling.
The following command was used, with sampling flags (--top-p, --top-k) adjusted for each scenario:
Server startup:
Client startup:
top-p example
top-k and top p+k can be adjusted flexibly
Test Result
Part A: Using lm_eval test
We conducted detailed tests on AMD MI300X GPUs, using the Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 model, and performed three-way comparison tests of top-p, top-k, and top-p+k for PyTorch native and aiter sampling, as shown in the following figure, and then Figure 1
Accuracy (gsm8k)
The accuracy difference between aiter and pytorch native is negligible, falling within the expected range of variance for different low-level sampling implementations.
Part B: Using vLLM Benchmarks
At the same time, we also conducted end-to-end testing and saw significant performance improvements.
The aiter operator provides a substantial performance uplift. On average, it achieves an ~1.6x increase in total throughput and a ~50% reduction in Time To First Token (TTFT).