Fuse RoPE and MLA KV-cache write #25774

PatrykSaffer · 2025-09-26T17:22:04Z

Purpose

It fuses RoPE and MLA KV-cache into a single kernel.

Test Plan

added test_rotary_embedding_mla_cache_fused.py

Test Result

tests pass

Benchmark

vllm bench serve --model deepseek-ai/DeepSeek-V3  --dataset-name sharegpt --sharegpt-output-len 100  --port 9020 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --backend vllm

Base

VLLM_ALL2ALL_BACKEND="deepep_low_latency"    vllm serve deepseek-ai/DeepSeek-V3 --trust-remote-code  --data-parallel-size 8 --tensor-parallel-size 1  --enable-expert-parallel --port 9020  --no-enable-prefix-caching

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  19.94     
Total input tokens:                      229418    
Total generated tokens:                  68058     
Request throughput (req/s):              50.14     
Output token throughput (tok/s):         3412.40   
Peak output token throughput (tok/s):    12207.00  
Peak concurrent requests:                1000.00   
Total Token throughput (tok/s):          14915.33  
---------------Time to First Token----------------
Mean TTFT (ms):                          8796.68   
Median TTFT (ms):                        9997.31   
P99 TTFT (ms):                           14175.75  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          146.09    
Median TPOT (ms):                        132.08    
P99 TPOT (ms):                           667.27    
---------------Inter-token Latency----------------
Mean ITL (ms):                           118.09    
Median ITL (ms):                         57.81     
P99 ITL (ms):                            2997.47   
==================================================

This Commit

VLLM_ENABLE_FUSED_ROPE_MLA_KV_WRITE=1 VLLM_ALL2ALL_BACKEND="deepep_low_latency"    vllm serve deepseek-ai/DeepSeek-V3 --trust-remote-code  --data-parallel-size 8 --tensor-parallel-size 1  --enable-expert-parallel --port 9020  --no-enable-prefix-caching

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  19.26     
Total input tokens:                      229418    
Total generated tokens:                  68084     
Request throughput (req/s):              51.92     
Output token throughput (tok/s):         3535.06   
Peak output token throughput (tok/s):    12472.00  
Peak concurrent requests:                1000.00   
Total Token throughput (tok/s):          15446.93  
---------------Time to First Token----------------
Mean TTFT (ms):                          8167.11   
Median TTFT (ms):                        8091.82   
P99 TTFT (ms):                           13437.33  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          151.83    
Median TPOT (ms):                        111.57    
P99 TPOT (ms):                           923.66    
---------------Inter-token Latency----------------
Mean ITL (ms):                           117.96    
Median ITL (ms):                         58.65     
P99 ITL (ms):                            3411.29   
==================================================

Accuracy

lm_eval --model local-completions --tasks gsm8k--model_args model=deepseek-ai/DeepSeek-V3,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100

Base

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.98|±  |0.0141|
|     |       |strict-match    |     5|exact_match|↑  | 0.98|±  |0.0141|

This Commit

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.98|±  |0.0141|
|     |       |strict-match    |     5|exact_match|↑  | 0.98|±  |0.0141|

Signed-off-by: Patryk Saffer <[email protected]>

mergify · 2025-09-29T15:55:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @PatrykSaffer.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Patryk Saffer <[email protected]>

Signed-off-by: PatrykSaffer <[email protected]>

ProExpertProg

Thanks for this contribution! Fusing these ops is definitely our goal. The current integration in this PR is unfortunately too intrusive to model definitions and forward code so we should instead perform the fusion using a torch.compile pass. Work on this is described in in #24678 and currently in-progress (#25103, #25954). Is it ok if this PR just adds the fused kernel and we integrate it with passes in follow-up PRs?

mergify · 2025-10-01T23:16:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @PatrykSaffer.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

fuse rope

5ca1edb

Signed-off-by: Patryk Saffer <[email protected]>

mergify bot added ci/build v1 tpu Related to Google TPUs labels Sep 26, 2025

PatrykSaffer changed the title ~~fuse rope~~ Fuse RoPE and MLA KV-cache write Sep 29, 2025

mergify bot added the needs-rebase label Sep 29, 2025

Patryk999 and others added 2 commits September 29, 2025 15:56

Env fixes

88f576d

Signed-off-by: Patryk Saffer <[email protected]>

Merge branch 'main' into fuse-rope

d796798

Signed-off-by: PatrykSaffer <[email protected]>

mergify bot removed the needs-rebase label Sep 29, 2025

PatrykSaffer marked this pull request as ready for review September 29, 2025 16:09

PatrykSaffer requested review from gshtras, mgoin, tlrmchlsmth, WoosukKwon, yewentao256, tdoublep, LucasWilkinson, zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners September 29, 2025 16:09

ProExpertProg requested changes Oct 1, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fuse RoPE and MLA KV-cache write #25774

Fuse RoPE and MLA KV-cache write #25774

PatrykSaffer commented Sep 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Sep 29, 2025

Uh oh!

ProExpertProg left a comment •

edited

Loading

Uh oh!

mergify bot commented Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

Fuse RoPE and MLA KV-cache write #25774

Are you sure you want to change the base?

Fuse RoPE and MLA KV-cache write #25774

Conversation

PatrykSaffer commented Sep 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Benchmark

Accuracy

Uh oh!

mergify bot commented Sep 29, 2025

Uh oh!

ProExpertProg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 1, 2025

Uh oh!

Uh oh!

PatrykSaffer commented Sep 26, 2025 •

edited by github-actions bot

Loading

ProExpertProg left a comment •

edited

Loading