Skip to content

Conversation

kliuae
Copy link
Contributor

@kliuae kliuae commented Apr 16, 2025

This PR enables aiter's tkw1 quantized MoE kernel to improve inferencing performance of compressed tensor Llama4 quantized with FP8. We have also revamped the aiter's MoE kernel dispatching to automatically choose the suitable AITER Fused MoE kernel without needing to set flags for kernel selection. Users only need to specify
VLLM_ROCM_USE_AITER=1 and VLLM_ROCM_USE_AITER_MOE=1 to activate aiter's MoE kernels, and the VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE flag is removed.

Note: torch.compile isn't supported in this PR yet, and the performance numbers are attained with V1 eager mode. The enablement of V1 torch compile for aiter MoE kernels will be addressed in a separate PR.

Llama4 Maverick FP8 throughput benchmarks

Without aiter tkw1
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MOE=0 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_LINEAR=0 SAFETENSORS_FAST_GPU=1 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --dataset-name random --input-len 1000 --output-len 1000 -tp 8 --max-model-len 8192 --enforce-eager
Throughput: 6.47 requests/s, 13159.90 total tokens/s, 6468.73 output tokens/s
With aiter tkw1
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1  VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_LINEAR=0 SAFETENSORS_FAST_GPU=1 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --dataset-name random --input-len 1000 --output-len 1000 -tp 8 --max-model-len 8192 --enforce-eager
Throughput: 7.94 requests/s, 16143.27 total tokens/s, 7937.78 output tokens/s

Llama4 Maverick FP8 latency benchmarks

Without aiter tkw1
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MOE=0 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_LINEAR=0 python -m vllm.entrypoints.openai.api_server --max-model-len 30000 --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -tp 8 --enforce-eager
============ Serving Benchmark Result ============
Successful requests:                     160
Benchmark duration (s):                  150.38
Total input tokens:                      160000
Total generated tokens:                  160000
Request throughput (req/s):              1.06
Output token throughput (tok/s):         1063.96
Total Token throughput (tok/s):          2127.93
---------------Time to First Token----------------
Mean TTFT (ms):                          268.25
Median TTFT (ms):                        153.78
P99 TTFT (ms):                           1199.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.79
Median TPOT (ms):                        29.78
P99 TPOT (ms):                           30.24
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.79
Median ITL (ms):                         29.34
P99 ITL (ms):                            51.17
==================================================
With aiter tkw1
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1  VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_LINEAR=0 python -m vllm.entrypoints.openai.api_server --max-model-len 30000 --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -tp 8 --enforce-eager
============ Serving Benchmark Result ============
Successful requests:                     160
Benchmark duration (s):                  117.88
Total input tokens:                      160000
Total generated tokens:                  160000
Request throughput (req/s):              1.36
Output token throughput (tok/s):         1357.26
Total Token throughput (tok/s):          2714.52
---------------Time to First Token----------------
Mean TTFT (ms):                          191.74
Median TTFT (ms):                        138.49
P99 TTFT (ms):                           783.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.37
Median TPOT (ms):                        23.41
P99 TPOT (ms):                           23.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.37
Median ITL (ms):                         23.07
P99 ITL (ms):                            44.99
==================================================

Text Generation Response

meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8

Without aiter tkw1
Prompt: 'The color of the sky is blue but sometimes it can also be', Generated text: ' red, orange or grey. What is the reason behind the different colors of the sky? _ Tiwari Academy Discussion\nThe color of the sky is blue but sometimes it can also be red, orange or grey. What is the reason behind the different colors of the sky?\nThe color of the sky is primarily determined by the scattering of sunlight by the Earth_s atmosphere. The most common color we see is blue, and this is due to a phenomenon called Rayleigh scattering. Here_s why the sky appears blue and how other colors can manifest under different conditions:\n1. Blue Sky (Rayleigh Scattering):\n_ During the daytime when the sun'
Prompt: 'The capital of France is', Generated text: ' Paris. It is a major European city and a global center for art, fashion, and culture. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and romantic atmosphere. Paris is a popular tourist destination and is often referred to as the "City of Light" due to its role in the Enlightenment and its many famous intellectuals and artists. \nParis, the capital of France, is a city steeped in history, art, and culture. It is one of the most visited cities in the world,'
Prompt: 'What is batch inference?', Generated text: ' - Azure Machine Learning | Microsoft Learn Skip to main content \nWhat is batch inference?\nBatch inference, or batch scoring, is the process of generating predictions on a batch of observations. Batch inference or batch scoring is a common pattern for machine learning (ML) models in production environments. The batch inference process can be run on a recurring schedule or on-demand.\nBatch inference is a key component of an end-to-end ML solution. An end-to-end ML solution typically requires:\n0. Data preparation and preprocessing\n1. Model training\n2. Model evaluation\n3. Model deployment\n4. Batch inference\n5. Monitoring and retraining\n'

With aiter tkw1
Prompt: 'The color of the sky is blue but sometimes it can also be', Generated text: ' red, orange, or violet. What is the reason behind the different colors of the sky? _ Tiwari Academy Discussion\nThe color of the sky is blue but sometimes it can also be red, orange, or violet. What is the reason behind the different colors of the sky?\nThe color of the sky is primarily determined by the scattering of sunlight by the Earth_s atmosphere. The most common color we see is blue because blue light is scattered more than other colors by the molecules and small particles in the atmosphere. However, during sunrise and sunset, the sky can appear red, orange, or violet due to the following reasons:\n1'
Prompt: 'The capital of France is', Generated text: ' Paris. It is a major European city and a global center for art, fashion, and culture. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and romantic atmosphere. Paris is a popular tourist destination and is often referred to as the _City of Light_ due to its role in the Enlightenment and its many famous intellectuals and artists. The city is divided into 20 arrondissements, or districts, each with its own unique character and charm. Paris is a must-visit destination for anyone interested'
Prompt: 'What is batch inference?', Generated text: " - Azure Machine Learning | Microsoft Learn Skip to main content \nWhat is batch inference?\nBatch inference, or batch scoring, is the process of generating predictions on a batch of observations. Batch inference or batch scoring is a common pattern for models that are trained offline. Batch inference can be used for both tabular data and unstructured data like images or text.\nBatch inference is typically used for offline scoring where the response time isn't critical. For example, a model that predicts energy demand for a utility company can be used to make predictions every hour, as the demand forecast is required only once per hour. In contrast, online inference is used for real-time scoring"

lm_eval Results

V1 without aiter, eager mode
vllm (pretrained=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,tensor_parallel_size=4,max_model_len=30000,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match _ 0.9272 _ 0.0072
strict-match 5 exact_match _ 0.9295 _ 0.0071

V1 with aiter, eager mode
vllm (pretrained=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,tensor_parallel_size=4,max_model_len=30000,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match _ 0.9227 _ 0.0074
strict-match 5 exact_match _ 0.9272 _ 0.0072

Reduce complexity of selecting AITER Fused MoE kernel

As the number of AITER Flags have increased, we have revamped the condition to pick the AITER Fused MoE kernel without the need of any flags. So VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE. User only need to specify VLLM_ROCM_USE_AITER=1andVLLM_ROCM_USE_AITER_MOE=1`
We have validated the code path of other models with the latest AITER fused moe selection logic:

mistralai_Mixtral-8x7B-Instruct-v0.1_V0

vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=1,max_model_len=30000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.6399 ± 0.0132
strict-match 5 exact_match 0.5216 ± 0.0138

mistralai_Mixtral-8x7B-Instruct-v0.1_FP8_V0

vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=1,max_model_len=30000,quantization=fp8,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.6111 ± 0.0134
strict-match 5 exact_match 0.4769 ± 0.0138

despseek-ai_DeepSeek-V3

vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=30000,gpu_memory_utilization=0.8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9492 ± 0.006
strict-match 5 exact_match 0.9500 ± 0.006

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Apr 16, 2025
@kliuae kliuae force-pushed the llama4-fp8-aiter branch from 88e60fb to 6659b99 Compare April 16, 2025 17:57
Signed-off-by: kliuae <[email protected]>
tjtanaa and others added 2 commits April 16, 2025 19:23
Co-authored-by: kliuae <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
vllm/envs.py Outdated
VLLM_ROCM_USE_AITER_LINEAR: bool = True
VLLM_ROCM_USE_AITER_MOE: bool = True
VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE: bool = False
VLLM_ROCM_USE_AITER_FP8_CHANNEL_SCALED_MOE: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the env name more align with the kernel name , in this case, to include tkw1 in the name?



def is_rocm_aiter_channel_scaled_moe_enabled() -> bool:
return is_rocm_aiter_moe_enabled() and \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this tkw1 enablement need to depend on is_rocm_aiter_moe_enabled() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this enablement we are following the block_scaled_moe case in using VLLM_ROCM_USE_AITER_MOE as a master switch for enabling MoE ops, to stay consistent with the other aiter kernels.

Comment on lines 43 to 48
if activation_str == "silu":
activation = ActivationType.Silu
elif activation_str == "gelu":
activation = ActivationType.Gelu
else:
activation = ActivationType.Silu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to one-liner ?

Suggested change
if activation_str == "silu":
activation = ActivationType.Silu
elif activation_str == "gelu":
activation = ActivationType.Gelu
else:
activation = ActivationType.Silu
activation = ActivationType.Gelu if activation_str == "gelu" else ActivationType.Silu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an additional wrapper for the _tkw1 kernel, given that it’s just a kernel call plus an activation type conversion? the activation type can also used by other branches / kernel calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are wrapping the kernel call because in our future PR addressing the enablement of torch compile for aiter MoE kernels, we will be using wrappers to register the aiter ops, and so we thought to leave it here for now.

Comment on lines 148 to 150
# # All AITER Fused MoE kernels are expecting the following datatypes
# topk_weights = topk_weights.to(torch.float32)
# topk_ids = topk_ids.to(torch.int32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# # All AITER Fused MoE kernels are expecting the following datatypes
# topk_weights = topk_weights.to(torch.float32)
# topk_ids = topk_ids.to(torch.int32)

@hongxiayang hongxiayang added the rocm Related to AMD ROCm label Apr 16, 2025
# topk_weights = topk_weights.to(torch.float32)
# topk_ids = topk_ids.to(torch.int32)

return rocm_aiter_asm_moe_tkw1(hidden_states,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assert apply_router_weight_on_input=True or do the if branch check when calling the _tkw1 kernel? btw, we should have some comments to illustrate the difference between _tkw1 kernel and other aiter kernels. The difference is on applying topk_weights on the output of the first GEMM or the second GEMM

Comment on lines 43 to 48
if activation_str == "silu":
activation = ActivationType.Silu
elif activation_str == "gelu":
activation = ActivationType.Gelu
else:
activation = ActivationType.Silu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an additional wrapper for the _tkw1 kernel, given that it’s just a kernel call plus an activation type conversion? the activation type can also used by other branches / kernel calls?

and layer.activation == "silu" and layer.expert_map is None):
return CompressedTensorsW8A8Fp8MoECutlassMethod(quant_config)
elif quant_config._is_fp8_w8a8(weight_quant, input_quant):
if is_rocm_aiter_channel_scaled_moe_enabled():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tkw1 is not a general support of FP8 FMOE channel / rowwise scaling, it only supports the case when apply_router_weight_on_input =True

kliuae and others added 4 commits April 17, 2025 09:55
Signed-off-by: kliuae <[email protected]>
…E_AITER_FP8_BLOCK_SCALED_MOE and VLLM_ROCM_USE_AITER_FP8_TKW1_MOE

Co-authored-by: kliuae <[email protected]>
Co-authored-by: vllmellm <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
…E_AITER_FP8_BLOCK_SCALED_MOE and VLLM_ROCM_USE_AITER_FP8_TKW1_MOE

Co-authored-by: kliuae <[email protected]>
Co-authored-by: vllmellm <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Copy link

mergify bot commented Apr 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kliuae.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 17, 2025
tjtanaa and others added 3 commits April 17, 2025 19:12
@mergify mergify bot removed the needs-rebase label Apr 18, 2025
)

if envs.VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE and use_fp8_w8a8:
# TODO: verify this code path for DeepSeekV3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we verify before landing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified: Will remove the comment.

2025-04-18:10:35:16 INFO [loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=30000,gpu_memory_utilization=0.8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9492 ± 0.006
strict-match 5 exact_match 0.9500 ± 0.006

Signed-off-by: tjtanaa <[email protected]>
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. Just a few nits.

layer.w2_weight = torch.nn.Parameter(shuffled_w2,
requires_grad=False)

if self.use_rocm_aiter_moe:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you merge these into one if statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Thanks for pointing this out.

is_rocm_aiter_moe_enabled)

# Property to determine if AITER is used
self.use_rocm_aiter_moe = is_rocm_aiter_moe_enabled()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do you need to store this in the class? It doesn't look like you are using it outside of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Updated this along with the merged if statement.

@hongxiayang hongxiayang added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 21, 2025
@vllm-bot vllm-bot merged commit 5b794ca into vllm-project:main Apr 22, 2025
60 of 63 checks passed
frieda-huang pushed a commit to frieda-huang/vllm that referenced this pull request Apr 23, 2025
Signed-off-by: kliuae <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Co-authored-by: tjtanaa <[email protected]>
Co-authored-by: vllmellm <[email protected]>
Signed-off-by: Frieda (Jingying) Huang <[email protected]>
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: kliuae <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Co-authored-by: tjtanaa <[email protected]>
Co-authored-by: vllmellm <[email protected]>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: kliuae <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Co-authored-by: tjtanaa <[email protected]>
Co-authored-by: vllmellm <[email protected]>
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
Signed-off-by: kliuae <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Co-authored-by: tjtanaa <[email protected]>
Co-authored-by: vllmellm <[email protected]>
Signed-off-by: Agata Dobrzyniewicz <[email protected]>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: kliuae <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Co-authored-by: tjtanaa <[email protected]>
Co-authored-by: vllmellm <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
@tjtanaa tjtanaa deleted the llama4-fp8-aiter branch May 16, 2025 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants