Skip to content

Conversation

PatrykSaffer
Copy link

@PatrykSaffer PatrykSaffer commented Sep 26, 2025

Purpose

It fuses RoPE and MLA KV-cache into a single kernel.

Test Plan

added test_rotary_embedding_mla_cache_fused.py

Test Result

tests pass

Benchmark

vllm bench serve --model deepseek-ai/DeepSeek-V3  --dataset-name sharegpt --sharegpt-output-len 100  --port 9020 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --backend vllm

Base

VLLM_ALL2ALL_BACKEND="deepep_low_latency"    vllm serve deepseek-ai/DeepSeek-V3 --trust-remote-code  --data-parallel-size 8 --tensor-parallel-size 1  --enable-expert-parallel --port 9020  --no-enable-prefix-caching
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  19.94     
Total input tokens:                      229418    
Total generated tokens:                  68058     
Request throughput (req/s):              50.14     
Output token throughput (tok/s):         3412.40   
Peak output token throughput (tok/s):    12207.00  
Peak concurrent requests:                1000.00   
Total Token throughput (tok/s):          14915.33  
---------------Time to First Token----------------
Mean TTFT (ms):                          8796.68   
Median TTFT (ms):                        9997.31   
P99 TTFT (ms):                           14175.75  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          146.09    
Median TPOT (ms):                        132.08    
P99 TPOT (ms):                           667.27    
---------------Inter-token Latency----------------
Mean ITL (ms):                           118.09    
Median ITL (ms):                         57.81     
P99 ITL (ms):                            2997.47   
==================================================

This Commit

VLLM_ENABLE_FUSED_ROPE_MLA_KV_WRITE=1 VLLM_ALL2ALL_BACKEND="deepep_low_latency"    vllm serve deepseek-ai/DeepSeek-V3 --trust-remote-code  --data-parallel-size 8 --tensor-parallel-size 1  --enable-expert-parallel --port 9020  --no-enable-prefix-caching
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  19.26     
Total input tokens:                      229418    
Total generated tokens:                  68084     
Request throughput (req/s):              51.92     
Output token throughput (tok/s):         3535.06   
Peak output token throughput (tok/s):    12472.00  
Peak concurrent requests:                1000.00   
Total Token throughput (tok/s):          15446.93  
---------------Time to First Token----------------
Mean TTFT (ms):                          8167.11   
Median TTFT (ms):                        8091.82   
P99 TTFT (ms):                           13437.33  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          151.83    
Median TPOT (ms):                        111.57    
P99 TPOT (ms):                           923.66    
---------------Inter-token Latency----------------
Mean ITL (ms):                           117.96    
Median ITL (ms):                         58.65     
P99 ITL (ms):                            3411.29   
==================================================

Accuracy

lm_eval --model local-completions --tasks gsm8k--model_args model=deepseek-ai/DeepSeek-V3,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100

Base

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.98|±  |0.0141|
|     |       |strict-match    |     5|exact_match|↑  | 0.98|±  |0.0141|

This Commit

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.98|±  |0.0141|
|     |       |strict-match    |     5|exact_match|↑  | 0.98|±  |0.0141|

Signed-off-by: Patryk Saffer <[email protected]>
@mergify mergify bot added ci/build v1 tpu Related to Google TPUs labels Sep 26, 2025
@PatrykSaffer PatrykSaffer changed the title fuse rope Fuse RoPE and MLA KV-cache write Sep 29, 2025
Copy link

mergify bot commented Sep 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @PatrykSaffer.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 29, 2025
Patryk999 and others added 2 commits September 29, 2025 15:56
Signed-off-by: Patryk Saffer <[email protected]>
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! Fusing these ops is definitely our goal. The current integration in this PR is unfortunately too intrusive to model definitions and forward code so we should instead perform the fusion using a torch.compile pass. Work on this is described in in #24678 and currently in-progress (#25103, #25954). Is it ok if this PR just adds the fused kernel and we integrate it with passes in follow-up PRs?

Copy link

mergify bot commented Oct 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @PatrykSaffer.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants