Skip to content

Conversation

@vllmellm
Copy link
Contributor

@vllmellm vllmellm commented Oct 10, 2025

Purpose

This PR supports fusion pass for ROCM AITER by fusing +rms_norm, aiter rmsnorm ops, and +quant_fp8, vllm quantization custom ops.

Benchmark Result

Metric Without Fusion Pass With Fusion Pass
Successful requests 500 500
Benchmark duration (s) 173.76 170.31
Total input tokens 520,558 520,558
Total generated tokens 456,122 456,834
Request throughput (req/s) 2.88 2.94
Output token throughput (tok/s) 2,625.02 2,682.37
Peak output token throughput (tok/s) 7,924.00 7,410.00
Peak concurrent requests 500.00 500.00
Total token throughput (tok/s) 5,620.87 5,738.91
Mean TTFT (ms) 35,048.69 34,637.29
Median TTFT (ms) 28,413.54 28,841.97
P99 TTFT (ms) 91,625.66 90,845.96
Mean TPOT (ms) 167.64 170.28
Median TPOT (ms) 117.48 118.62
P99 TPOT (ms) 881.79 913.72
Mean ITL (ms) 114.17 113.99
Median ITL (ms) 57.83 61.15
P99 ITL (ms) 2,111.96 2,086.31

benchmark setting
vllm bench serve \ --backend vllm \ --model "RedHatAI/Qwen3-14B-FP8-dynamic" \ --dataset-name random \ --num-prompts 500 \ --random-input-len 1000 \ --random-output-len 1000 \ --endpoint /v1/completions \ --random-range-ratio 0.9 \

IMPORTANT NOTE
use --compilation-config '{"pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false} \, "custom_ops": ["+rms_norm", "+quant_fp8"] to enable fusion pass.

Test Plan

  • Unit test has been added in vllm/tests/compile/test_rocm_aiter_fusion.py that verifies accuracy, replacement of the ops in cuda graph.
  • end to end test using RedHatAI/Qwen3-14B-FP8-dynamic model

environment setting
Step 1: run vllm serve
VLLM_ROCM_USE_AITER=1 \ VLLM_USE_V1=1 \ vllm serve RedHatAI/Qwen3-14B-FP8-dynamic \ --compilation-config '{"pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false} \, "custom_ops": ["+rms_norm", "+quant_fp8"], "cudagraph_capture_sizes": [1,2,4,8,16,24,32,256]}' \ --port 9090 \ --trust-remote-code --swap-space 16 --distributed-executor-backend mp
Step 2: run lm_eval

lm_eval --model local-completions --tasks gsm8k \ --model_args model=RedHatAI/Qwen3-14B-FP8-dynamic,base_url=http://localhost:9090/v1/completions \ --trust_remote_code \ --num_fewshot 5 \ --batch_size 128

Test Results

RedHatAI/Qwen3-14B-FP8-dynamic fusion pass

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7612 ± 0.0117
strict-match 5 exact_match 0.8741 ± 0.0091

RedHatAI/Qwen3-14B-FP8-dynamic without fusion pass

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7718 ± 0.0116
strict-match 5 exact_match 0.8741 ± 0.0091

Unit test result

INFO 10-10 08:39:08 [init.py:224] Automatically detected platform rocm.
============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.4.2, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /app/norm/vllm
configfile: pyproject.toml
plugins: anyio-4.10.0, asyncio-1.2.0
asyncio: mode=strict, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... WARNING 10-10 08:39:11 [interface.py:518] Current platform cuda does not have 'test' attribute.
WARNING 10-10 08:39:11 [interface.py:518] Current platform cuda does not have 'bases' attribute.
WARNING 10-10 08:39:11 [interface.py:518] Current platform cuda does not have 'test' attribute.
collected 2 items

compile/test_rocm_aiter_fusion.py::test_fusion_rmsnorm_quant[1e-05-257-64-dtype0] Matched count: 2
PASSED
compile/test_rocm_aiter_fusion.py::test_fusion_rmsnorm_quant[1e-06-257-64-dtype0] Matched count: 2
PASSED
======================== 2 passed, 2 warnings in 25.65s ========================


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the rocm Related to AMD ROCm label Oct 10, 2025
@vllmellm vllmellm marked this pull request as ready for review October 10, 2025 18:05
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@mergify
Copy link

mergify bot commented Oct 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant