[Kernel] Chunk-aligned mamba2 #24683

tdoublep · 2025-09-11T18:07:26Z

Purpose

This PR changes the way that the mamba2 kernels split the batch into "chunks". The change ensures that (a) no chunk ever contains more than one sequence, and (b) all intermediate states are computed at the chunk boundaries within each sequence.

This change is useful for three reasons:

It dramatically simplifies the kernels due to (a).
It enables much easier implementation of prefix caching for mamba due to (b)
It can improve performance, even without prefix caching, because we can entirely skip the final call to the "varlen" kernel that is used to align the final states for each sequence.

The downside is that it introduces some "virtual" padding inside the chunks. We don't actually pad anything in GPU memory, we just potentially need to use a larger grid when launching kernels and may do some redundant compute. However, this padding is bounded to at most one chunk per sequence, and my initial experiments suggest it really doesn't hurt a lot. In fact, we actually see a significant speedup because we skip the call to the final "varlen" kernel. We follow a very similar approach for working with varlen batches in the Triton attention kernels, so this kind of technique is not without precedent.

TODO:

seq_idx can be made simpler - we just need to keep track of the seq_idx per chunk
strip out redundant meta-data like chunk_indices and chunk_offsets
Merge [Model] Mamba2 varlen and metadata refactor #21467 before this one since it removes a lot of redundant code.
Merge [Model] Clean up and simplify Mamba2 Metadata Usage in both V0 and V1 #24331 before this one since it cleans things up further.

Simple example for two sequences A and B is shown below. A0 and B0 represent the chunks that were prefilled at the previous step, and A1 and B1 are the new chunks we want to prefill in this iteration.

The idea is that for sequence A, we first take enough tokens from the new part (A1) to ensure that, when taking together with the precomputed part (A0), the state is chunked-aligned. Then we fill chunks with new tokens (from A1) until we run out, at which we pad to the chunk boundary. Then repeat for B.

Test Plan

See correctness + benchmarking below.

Test Result

See correctness + benchmarking below.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-09-12T12:12:33Z

Server:

vllm serve ibm-granite/granite-4.0-tiny-preview --enforce-eager

Client:

lm_eval --model local-completions --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500     \
    --model_args model=ibm-granite/granite-4.0-tiny-preview,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,max_retries=3,tokenized_requests=False

Results (main):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.608|±  |0.0219|
|     |       |strict-match    |     5|exact_match|↑  |0.584|±  |0.0221|

Results (tpa-mamba-aligned):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.616|±  |0.0218|
|     |       |strict-match    |     5|exact_match|↑  |0.590|±  |0.0220|

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-09-12T18:23:40Z

Server:

vllm serve ibm-granite/granite-4.0-tiny-preview

Benchmark:

vllm bench serve \
        --model ibm-granite/granite-4.0-tiny-preview \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --ignore_eos

Branch main (second run):

============ Serving Benchmark Result ============
Successful requests:                     983       
Benchmark duration (s):                  32.64     
Total input tokens:                      235252    
Total generated tokens:                  222931    
Request throughput (req/s):              30.12     
Output token throughput (tok/s):         6830.50   
Total Token throughput (tok/s):          14038.50  
---------------Time to First Token----------------
Mean TTFT (ms):                          5419.54   
Median TTFT (ms):                        5404.06   
P99 TTFT (ms):                           9049.08   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          111.52    
Median TPOT (ms):                        72.42     
P99 TPOT (ms):                           303.04    
---------------Inter-token Latency----------------
Mean ITL (ms):                           53.58     
Median ITL (ms):                         35.74     
P99 ITL (ms):                            245.94    
==================================================

Branch tpa-aligned-mamba (Second run):

============ Serving Benchmark Result ============
Successful requests:                     983       
Benchmark duration (s):                  32.34     
Total input tokens:                      233074    
Total generated tokens:                  223781    
Request throughput (req/s):              30.39     
Output token throughput (tok/s):         6918.85   
Total Token throughput (tok/s):          14125.03  
---------------Time to First Token----------------
Mean TTFT (ms):                          4103.87   
Median TTFT (ms):                        4083.23   
P99 TTFT (ms):                           7084.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          105.28    
Median TPOT (ms):                        83.67     
P99 TPOT (ms):                           248.41    
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.46     
Median ITL (ms):                         36.00     
P99 ITL (ms):                            320.39    
==================================================

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-09-27T09:36:07Z

More benchmarking data, this time for NVIDIA-Nemotron-Nano-12B-v2.

Server

vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2

Client

vllm bench serve \
    --model nvidia/NVIDIA-Nemotron-Nano-12B-v2 \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --ignore_eos

Results from main:

============ Serving Benchmark Result ============
Successful requests:                     983       
Benchmark duration (s):                  54.60     
Total input tokens:                      218758    
Total generated tokens:                  201157    
Request throughput (req/s):              18.00     
Output token throughput (tok/s):         3684.31   
Peak output token throughput (tok/s):    7049.00   
Peak concurrent requests:                983.00    
Total Token throughput (tok/s):          7691.00   
---------------Time to First Token----------------
Mean TTFT (ms):                          13463.72  
Median TTFT (ms):                        9844.10   
P99 TTFT (ms):                           35302.82  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          145.05    
Median TPOT (ms):                        109.78    
P99 TPOT (ms):                           698.06    
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.24     
Median ITL (ms):                         70.37     
P99 ITL (ms):                            467.66    
==================================================

Results from tpa-aligned-mamba:

============ Serving Benchmark Result ============
Successful requests:                     983       
Benchmark duration (s):                  47.08     
Total input tokens:                      219876    
Total generated tokens:                  200385    
Request throughput (req/s):              20.88     
Output token throughput (tok/s):         4256.43   
Peak output token throughput (tok/s):    6920.00   
Peak concurrent requests:                983.00    
Total Token throughput (tok/s):          8926.88   
---------------Time to First Token----------------
Mean TTFT (ms):                          9829.15   
Median TTFT (ms):                        6547.68   
P99 TTFT (ms):                           27941.63  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          113.15    
Median TPOT (ms):                        92.88     
P99 TPOT (ms):                           444.97    
---------------Inter-token Latency----------------
Mean ITL (ms):                           75.63     
Median ITL (ms):                         66.86     
P99 ITL (ms):                            399.50    
==================================================

tomeras91

looks great overall! Really simplifies the code

I added a few nit comments about the need for comments and documenting expected shapes

I also think it's worth adding a general comment somewhere that in this implementation we're assuming each chunk has only a single sequence, since this is a significant change which is different from the original implementation

vllm/v1/attention/backends/mamba2_attn.py

vllm/model_executor/layers/mamba/ops/ssd_combined.py

vllm/model_executor/layers/mamba/ops/ssd_state_passing.py

vllm/model_executor/layers/mamba/ops/ssd_bmm.py

Signed-off-by: Thomas Parnell <[email protected]>

tomeras91

Thanks @tdoublep. LGTM now. The descriptions and shapes really help

tlrmchlsmth

PR looks great at first pass. Love to see more red than green.

tlrmchlsmth · 2025-09-29T19:15:43Z

In the figure in the PR description, why does A1.a fall at the beginning of the chunk rather than the end? I thought A0 should be ahead of it rather than behind

tdoublep · 2025-09-29T19:23:49Z

In the figure in the PR description, why does A1.a fall at the beginning of the chunk rather than the end? I thought A0 should be ahead of it rather than behind

@tlrmchlsmth A0 isn't actually added to the chunk, it has already been prefilled and doesn't need to be computed again. We just need to partition A1 in such a way that len(A0)+len(A1.a)=chunk_size so that the intermediate states we get at the output of the first chunk correspond to the actual chunk boundaries within the sequence. That's why the part of the chunk after A1.a is grey to indicate that it gets padded (not actually padding in memory, only compute).

tlrmchlsmth · 2025-09-29T19:29:52Z

Do the padded regions get loaded at all?

In the figure in the PR description, why does A1.a fall at the beginning of the chunk rather than the end? I thought A0 should be ahead of it rather than behind

@tlrmchlsmth A0 isn't actually added to the chunk, it has already been prefilled and doesn't need to be computed again. We just need to partition A1 in such a way that len(A0)+len(A1.a)=chunk_size so that the intermediate states we get at the output of the first chunk correspond to the actual chunk boundaries within the sequence. That's why the part of the chunk after A1.a is grey to indicate that it gets padded (not actually padding in memory, only compute).

makes sense. So then the A0-sized padded region could overlap with another chunk, or it could fall off the end of the KV cache tensor, right? Do we mask off the loads of the padded region as well?

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-09-29T19:35:37Z

Do the padded regions get loaded at all?

No, padding is maybe the wrong word. There isn't any actual padding of tensors in memory here.

Masking would probably be a better word. If we have 5 chunks like in the above example, we would launch a Triton kernel with a grid size of (5,..) and in the first chunk we mask out the last len(A0) slots, the second chunk we mask out nothing, third chunk we mask out chunk_size-len(A1.c)

We are basically trading off a bit of extra compute in order to get intermediate states at exactly where we want them within each sequence. It turns out it isn't really a trade-off since it strips out so much complexity, it is a net-win.

tdoublep · 2025-09-29T19:42:10Z

So then the A0-sized padded region could overlap with another chunk, or it could fall off the end of the KV cache tensor, right?

Yes, if we don't introduce the padding/masking it will lead to (a) having multiple sequences within the same chunk and (b) needing this whole mapping between "logical" and "physical" chunks to track where everything is.

Do we mask off the loads of the padded region as well?

Yes, we mask off the loads exactly (example: https://github.com/tdoublep/vllm/blob/tpa-aligned-mamba/vllm/model_executor/layers/mamba/ops/ssd_chunk_scan.py#L231)

Signed-off-by: yewentao256 <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: xuebwang-amd <[email protected]>

tdoublep added 4 commits September 11, 2025 05:45

working changes

dddb650

Signed-off-by: Thomas Parnell <[email protected]>

Merge branch 'main' into tpa-aligned-mamba

1dc7a04

working changes

2a7b216

Signed-off-by: Thomas Parnell <[email protected]>

fix bug

664a21a

Signed-off-by: Thomas Parnell <[email protected]>

mergify bot added the v1 label Sep 11, 2025

tdoublep added 14 commits September 11, 2025 14:14

fix bug

6c475d6

Signed-off-by: Thomas Parnell <[email protected]>

fix bug

0d00c69

Signed-off-by: Thomas Parnell <[email protected]>

working changes

b7ae698

Signed-off-by: Thomas Parnell <[email protected]>

Fix bugs

9b24bce

Signed-off-by: Thomas Parnell <[email protected]>

working changes

e850661

Signed-off-by: Thomas Parnell <[email protected]>

revert some changes

0d5c3ae

Signed-off-by: Thomas Parnell <[email protected]>

fmt

31e05fa

Signed-off-by: Thomas Parnell <[email protected]>

workign

a8aff97

Signed-off-by: Thomas Parnell <[email protected]>

working changes

67db9b4

Signed-off-by: Thomas Parnell <[email protected]>

working changes

d841e82

Signed-off-by: Thomas Parnell <[email protected]>

working changes

908aecb

Signed-off-by: Thomas Parnell <[email protected]>

working changes

af7a246

Signed-off-by: Thomas Parnell <[email protected]>

Some test cases working

7ce2b59

Signed-off-by: Thomas Parnell <[email protected]>

Fix IMA

f950f2e

Signed-off-by: Thomas Parnell <[email protected]>

Add back autotune config

75e01c8

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep added 8 commits September 12, 2025 14:47

cleanup

2698f2e

Signed-off-by: Thomas Parnell <[email protected]>

cleanup

d3f05b7

Signed-off-by: Thomas Parnell <[email protected]>

cleanup

df63503

Signed-off-by: Thomas Parnell <[email protected]>

cleanup

c5edccd

Signed-off-by: Thomas Parnell <[email protected]>

cleanup

712ced1

Signed-off-by: Thomas Parnell <[email protected]>

cleanup

dc85f7e

Signed-off-by: Thomas Parnell <[email protected]>

cleanup

5e827a6

Signed-off-by: Thomas Parnell <[email protected]>

cleanup

42e4b27

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep requested review from alexm-redhat, comaniac, njhill and ywang96 as code owners September 26, 2025 21:06

tdoublep added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 26, 2025

tomeras91 reviewed Sep 28, 2025

View reviewed changes

tdoublep added 2 commits September 29, 2025 14:24

Fix plamo2

51b756b

Signed-off-by: Thomas Parnell <[email protected]>

Review comments

37ffa92

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep requested a review from LucasWilkinson as a code owner September 29, 2025 18:44

tlrmchlsmth self-assigned this Sep 29, 2025

tomeras91 approved these changes Sep 29, 2025

View reviewed changes

tlrmchlsmth reviewed Sep 29, 2025

View reviewed changes

Fix plamo2 again

29b42cc

Signed-off-by: Thomas Parnell <[email protected]>

tlrmchlsmth approved these changes Sep 29, 2025

View reviewed changes

tdoublep merged commit fea3e47 into vllm-project:main Sep 29, 2025
52 checks passed

This was referenced Sep 30, 2025

bring _query_start_loc_to_chunk_indices_offsets back to test_mamba_ssm_ssd.py #25980

Closed

Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets #25995

Merged

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

4142c77

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Kernel] Chunk-aligned mamba2 (#24683)

b7973ea

Signed-off-by: yewentao256 <[email protected]>

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

1476b1c

Signed-off-by: Tomer Asida <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

e79fb54

Signed-off-by: xuebwang-amd <[email protected]>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

2ad2f3c

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

5cf78b2

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

9736a13

Signed-off-by: xuebwang-amd <[email protected]>

Uh oh!

Uh oh!

[Kernel] Chunk-aligned mamba2 #24683

[Kernel] Chunk-aligned mamba2 #24683

Conversation

tdoublep commented Sep 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

tdoublep commented Sep 12, 2025

Uh oh!

tdoublep commented Sep 12, 2025

Uh oh!

tdoublep commented Sep 27, 2025

Uh oh!

tomeras91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomeras91 left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlrmchlsmth commented Sep 29, 2025

Uh oh!

tdoublep commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tdoublep commented Sep 11, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth commented Sep 29, 2025 •

edited

Loading

tdoublep commented Sep 29, 2025 •

edited

Loading

tdoublep commented Sep 29, 2025 •

edited

Loading