[FEAT] [ROCm]: AITER Fused MOE V1 Support #16752

vllmellm · 2025-04-17T03:26:34Z

Description

This PR integrates enables Aiter's fused Mixture-of-Experts ops, found here, to be used with v1.

Implementation

The following ops have been added/modified and registered as custom ops:

rocm_aiter_ck_moe
rocm_aiter_fmoe_fp8_blockscale_g1u1
rocm_aiter_asm_moe
rocm_aiter_topk_softmax
rocm_aiter_shuffle_weight
rocm_aiter_asm_moe_tkw1

Testing

The integration has been verified through:

High-level integration tests with various models.
Accuracy Test using Lmeval.

Accuracy Test GSM8K

The following command has been used to run Lmeval on the following models:

Llama-4-Maverick-17B-128E-Instruct
Llama-4-Maverick-17B-128E-Instruct-FP8
DeepSeek-V3
Mixtral-8x7B-Instruct-v0.1
Mixtral-8x7B-Instruct-v0.1(FP8)

VLLM_USE_TRITON_FLASH_ATTN=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_ROCM_USE_AITER=0 \
VLLM_ROCM_USE_AITER_RMSNORM=0 \
VLLM_ROCM_USE_AITER_LINEAR=0 \
SAFETENSORS_FAST_GPU=1 \
lm_eval \
--model vllm \
--model_args pretrained=model_name,tensor_parallel_size=8,enforce_eager=False,max_model_len=4096 \
--trust_remote_code \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size auto

Additionally we set some addiational vars/args for some models as specified below:

Llama-4-Maverick-17B-128E-Instruct:

VLLM_USE_V1=1

Llama-4-Maverick-17B-128E-Instruct-FP8:

VLLM_USE_V1=1

DeepSeek-V3:

VLLM_USE_V1=0

Mixtral-8x7B-Instruct-v0.1:

VLLM_USE_V1=1

Mixtral-8x7B-Instruct-v0.1(FP8):

VLLM_USE_V1=1
--quantization fp8

We provide the table below to show the lm_eval results :

Model	vLLM version	Tasks	Version	Filter	n-shot	Metric		Value		Stderr
Llama-4-Maverick-17B-128E-Instruct-BF16	V1	gsm8k	3	flexible-extract	5	exact_match	↑	0.9272	±	0.0072
				strict-match	5	exact_match	↑	0.9272	±	0.0072
Llama-4-Maverick-17B-128E-Instruct-FP8	V1	gsm8k	3	flexible-extract	5	exact_match	↑	0.9234	±	0.0073
				strict-match	5	exact_match	↑	0.9272	±	0.0072
DeepSeek-V3	V0	gsm8k	3	flexible-extract	5	exact_match	↑	0.9454	±	0.063
				strict-match	5	exact_match	↑	0.9454	±	0.063
Mixtral-8x7B-Instruct-v0.1	V1	gsm8k	3	flexible-extract	5	exact_match	↑	0.6452	±	0.0132
				strict-match	5	exact_match	↑	0.6429	±	0.0132
Mixtral-8x7B-Instruct-v0.1 (FP8)	V1	gsm8k	3	flexible-extract	5	exact_match	↑	0.5413	±	0.0137
				strict-match	5	exact_match	↑	0.5398	±	0.0137

This PR is part of a larger effort to integrate AITER kernels into vLLM for improved performance on the ROCm platform.

Co-authored-by: tjtanaa <[email protected]> Signed-off-by: vllmellm <[email protected]>

github-actions · 2025-04-17T03:26:43Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: vllmellm <[email protected]>

hongxiayang · 2025-04-23T14:59:06Z

cc @houseroad This enables AITER kennel Cudagraph mode for llama4 models in V1 for performance.

sijiac · 2025-04-24T06:13:23Z