Skip to content

Conversation

divakar-amd
Copy link
Contributor

@divakar-amd divakar-amd commented Aug 12, 2025

Purpose

Replace torch.bmm with aiter batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant kernel for MLA

Test Plan

  • Correctness check
  • Perf check

Test Result

Correctness Result


Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    ' Christian Munoz and\nthis is my blog where I cover different\nIT topics'
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' in charge of which branch of government? A. judicial B. legislative C.'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' Paris. Paris is located along the Seine River in the north-central part of the'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    " a fascinating and rapidly evolving field. Here's a glimpse into some key areas shaping"
------------------------------------------------------------

Performance test on DeepSeek-R1 with full-cudagraph capture mode

REQUEST_RATES=(1 5 7 9)
TOTAL_SECONDS=20 
TP=8
OUTPUT_LEN=128

DATASET_PATH="ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json"
PYTHON_BENCH_SCRIPT="benchmarks/benchmark_serving.py"
   VLLM_USE_V1=1 vllm serve $MODEL \
        --port 8004 \
        --tensor-parallel-size $TP \
        --max-num-seqs 256 \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --disable-uvicorn-access-log \
        --block-size 1 \
        -O '{"full_cuda_graph":true}' 
        python3 $PYTHON_BENCH_SCRIPT \
            --model $MODEL \
            --percentile-metrics ttft,tpot,itl,e2el \
            --dataset-path $DATASET_PATH \
            --request-rate $REQUEST_RATE \
            --num-prompts $(($TOTAL_SECONDS * $REQUEST_RATE)) \
            --ignore-eos \
            --port 8004 \
            --sharegpt-output-len $OUTPUT_LEN

resultPlot_basecudagraph-vs-aiterbmm_cudagraph_Median

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added rocm Related to AMD ROCm v1 labels Aug 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Triton FP8 BMM kernel for MLA on ROCm to enhance performance. The changes include adding an environment variable to toggle this feature and updating the MLA attention implementation to leverage the new kernel for matrix multiplications. The implementation also involves quantizing weights to FP8 and includes a warmup loop for the Triton kernel. My review has identified two high-severity issues: the use of a hardcoded FP8 dtype instead of a platform-specific one, and a hardcoded warmup range for the Triton kernel that is insufficient for default configurations, potentially leading to performance degradation.

k50112113 and others added 6 commits August 12, 2025 20:21
Signed-off-by: Divakar Verma <[email protected]>
Signed-off-by: Divakar Verma <[email protected]>
Signed-off-by: Divakar Verma <[email protected]>
Signed-off-by: Divakar Verma <[email protected]>
mxz297 and others added 10 commits August 20, 2025 09:38
Signed-off-by: Xiaozhu <[email protected]>
Signed-off-by: Michael Goin <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: RUTHLESS-BOT <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Michael Goin <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models structured-output speculative-decoding labels Aug 20, 2025
@mergify mergify bot added tpu Related to Google TPUs tool-calling labels Aug 20, 2025
Copy link

mergify bot commented Aug 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @divakar-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@divakar-amd
Copy link
Contributor Author

Re-created another PR with some updates: #23264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models llama Related to Llama models multi-modality Related to multi-modality (#4194) needs-rebase new-model Requests to new models performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling tpu Related to Google TPUs v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.