[V1] [Hybrid] Mamba2 Automatic Prefix Caching #25752

s3woz · 2025-09-26T08:18:34Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model

Purpose

This PR implements Automatic Prefix Caching (APC) for Mamba2 hybrid models.
Logic before this PR:

Mamba2 implementation uses single cache block for the "current state" updated in-place

This PR introduces APC logic for Mamba2 by storing states at input block boundaries, and resuming the computations from them when the cache is hit. The chart below shows timing results from vllm bench latency --model ibm-granite/granite-4.0-tiny-preview --num-iters 10 --num-iters-warmup 2 for three cases:

vLLM main
This PR with APC off --no-enable-prefix-caching
This PR with APC on --enable-prefix-caching

As prefill length increases, the APC-on mode provides clear benefits for long prefill-dominated cases (e.g. see decode 1). For decode-heavy cases (e.g. decode 1024), the performance is suboptimal due to how vLLM currently implements the metadata building for KVCache groups. Our internal evaluations show that the decode speed overhead can be eliminated if the additional metadata produced in the APC-on mode is cached between the groups. A general implementation of such functionality is pending in #22788 .

Technical considerations:

Current vLLM logic: Cache Manager assumes that page sizes should be equal for various attention implementations. block_size is determined as the smallest attention block size for which attention page size >= Mamba2 page size, and Mamba2 state blocks are padded.
In this PR we introduce additional condition: Due to Mamba2 kernel specifics, obtaining intermediate states efficiently during prefill is possible every mamba_chunk_size (typically 256). Thus, we assume that the block_size should be set to a multiple of 256.
To obtain the intermediate mamba states at mamba_chunk_size boundaries and at the same time ensure high mamba kernel performance, the kernels need to process the sequences in a chunk-aligned manner, with chunk boundaries aligned to the absolute sequence length. The kernels are modified accordingly in this PR, pulling changes from [Kernel] Chunk-aligned mamba2 #24683 .

Pending enhancements:

Ensure that Mamba Metadata is properly cached when [Attention] Cache attention metadata builds across hybrid KV-cache groups #22788 is merged.
Potential early memory freeing: For running requests free up all blocks that aren't needed by anymore (adjust remove_skipped_blocks in class MambaManager, and max_memory_usage_bytes).
Currently padding is applied only to mamba cache pages. Depending on scenarios, it might be more memory-efficient to pad attention cache pages instead.
Different strategies could be implemented to choose which states to store. Currently all states are stored: as requests arrive, allocate a number of blocks proportional to the sequence length in order to store all states at block boundaries (places where cache hits may occur), and the current state. Potential future enhancements include sparser strategies (reducing memory usage and I/O), such as just store last state strategy: to limit block allocations, allocate only two blocks - the last block-aligned intermediate state (to allow for cache hits), and the current state.
Support speculative decoding.
Remove overhead of mamba2 state caching from PyTorch level and move into Triton kernel.
Extend testcases from E2E tests to specific "cache logic" tests

@tdoublep @bohnstingl

Test Plan

from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory
import time
MODEL = "ibm-granite/granite-4.0-tiny-preview"
PROMPT_MULTIPLE = 310
sampling_params = SamplingParams(temperature=0.0)
prefix = ( # examples/offline_inference/prefix_caching.py
    "You are an expert school principal, skilled in effectively managing "
    "faculty and staff. Draft 10-15 questions for a potential first grade "
    "Head Teacher for my K-12, all-girls', independent school that emphasizes "
    "community, joyful discovery, and life-long learning. The candidate is "
    "coming in for a first-round panel interview for a 8th grade Math "
    "teaching role. They have 5 years of previous teaching experience "
    "as an assistant teacher at a co-ed, public school with experience "
    "in middle school math teaching. ")
prefix2 = ("Based on these information, fulfill "
            "the following paragraph: ")
prompt = PROMPT_MULTIPLE * prefix + prefix2 + "Hello, my name is"
print('Prompt length:', len(prompt))
for APC in [False, True]:
    engine = LLM(model=MODEL, enable_prefix_caching=APC, 
        gpu_memory_utilization=0.4, disable_log_stats=False)
    for i in range(3):
        if i == 0:
            print('Warm-up')
        if i == 1:
            print('Measuring')
            start_time = time.time()
        outputs = engine.generate(prompt, sampling_params)
        print('APC:', APC, i, f"Generated text: {outputs[0].outputs[0].text!r}")
        for m in engine.llm_engine.get_metrics():
            if 'vllm:prefix_cache_hits' in m.name:
                print(m.name, m.value)
    print('APC:', APC, "loop took --- %s seconds ---" % (time.time() - start_time))
    del engine
    cleanup_dist_env_and_memory()

Test Result

Warm-up
Adding requests: 100%|---------| 1/1 [00:00<00:00,  9.89it/s]
Processed prompts: 100%|------| 1/1 [00:08<00:00,  8.13s/it, est. speed input: 4540.29 toks/s, output: 1.97 toks/s]
APC: False 0 Generated text: ' NAME_1. I am an expert school principal, skilled in effectively managing'
vllm:prefix_cache_hits 0
Measuring
Adding requests: 100%|---------| 1/1 [00:00<00:00, 12.45it/s]
Processed prompts: 100%|----| 1/1 [00:00<00:00,  1.79it/s, est. speed input: 66076.01 toks/s, output: 28.64 toks/s]
APC: False 1 Generated text: ' NAME_1. I am an expert school principal, skilled in effectively managing'
vllm:prefix_cache_hits 0
Adding requests: 100%|------| 1/1 [00:00<00:00, 11.90it/s]
Processed prompts: 100%|----| 1/1 [00:00<00:00,  1.78it/s, est. speed input: 65945.06 toks/s, output: 28.59 toks/s]
APC: False 2 Generated text: ' NAME_1. I am an expert school principal, skilled in effectively managing'
vllm:prefix_cache_hits 0
APC: False loop took --- 1.2919602394104004 seconds ---

Warm-up
Adding requests: 100%|-------------| 1/1 [00:00<00:00,  9.78it/s]
Processed prompts: 100%|----------| 1/1 [00:08<00:00,  8.19s/it, est. speed input: 4505.37 toks/s, output: 1.95 toks/s]
APC: True 0 Generated text: ' NAME_1. I am the candidate for the position of a Math teacher.'
vllm:prefix_cache_hits 0
Measuring
Adding requests: 100%|-------------| 1/1 [00:00<00:00, 11.40it/s]
Processed prompts: 100%|----------| 1/1 [00:00<00:00,  4.72it/s, est. speed input: 174700.34 toks/s, output: 75.73 toks/s]
APC: True 1 Generated text: ' NAME_1, and I am an expert school principal, skilled in effectively'
vllm:prefix_cache_hits 36864
Adding requests: 100%|-------------| 1/1 [00:00<00:00, 11.25it/s]
Processed prompts: 100%|----------| 1/1 [00:00<00:00,  5.86it/s, est. speed input: 217302.08 toks/s, output: 94.19 toks/s]
APC: True 2 Generated text: ' NAME_1, and I am an expert school principal, skilled in effectively'
vllm:prefix_cache_hits 73728
APC: True loop took --- 0.5674521923065186 seconds ---

Signed-off-by: Stanislaw Wozniak <[email protected]>

Signed-off-by: Thomas Ortner <[email protected]>

Signed-off-by: Thomas Parnell <[email protected]>

Signed-off-by: Stanislaw Wozniak <[email protected]>

Signed-off-by: Thomas Parnell <[email protected]>

Signed-off-by: Stanislaw Wozniak <[email protected]>

Signed-off-by: Thomas Ortner <[email protected]>

Signed-off-by: Stanislaw Wozniak <[email protected]>

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep

Thanks for the big effort + perseverance. I think this is now in a shape that we can merge it. It should already give good speedups for prefill-dominated latency benchmarks.

@s3woz @bohnstingl -- could you please create a new Issue to track the remaining work items?

There is quite a bit of interest from the community in helping with these follow-ups so it would be good to merge this so we can start parallelizing up the work.

Signed-off-by: Thomas Parnell <[email protected]>

tlrmchlsmth · 2025-10-03T21:13:50Z

vllm/model_executor/layers/mamba/ops/causal_conv1d.py

+
+            mask = (idx_tokens_conv < state_len)[:, None] & \
+                   (idx_feats < dim)[None, :]
+            tl.debug_barrier()  #  NOTE: use this due to bug in Triton compiler


is there an issue we could link to here?

This is also just duplicated from the equivalent code on main: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/mamba/ops/causal_conv1d.py#L185

I strongly suspect we can remove that, but prefer to do a big cleanup of this kernel as a follow-up.

I removed the debugging statements and I haven't seen any negative side-effects yet. However, I haven't changed it in this PR, but this could be something for a follow-up PR.

tlrmchlsmth · 2025-10-03T21:14:35Z

vllm/model_executor/layers/mamba/ops/causal_conv1d.py

    stride_istate_dim = 0
    stride_istate_token = 0
    num_cache_lines = 0
+    BLOCK_M = 8


where does this number come from?

It is hard-coded on main, we just moved the definition earlier: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/mamba/ops/causal_conv1d.py#L618

tlrmchlsmth · 2025-10-03T21:15:28Z

vllm/model_executor/layers/mamba/ops/causal_conv1d.py

+    initial_state_idx: (batch,), dtype int32
+        The pointer into cache_indices, which signifies the cache block containing the initial state.


Is this right?

Suggested change

initial_state_idx: (batch,), dtype int32

The pointer into cache_indices, which signifies the cache block containing the initial state.

initial_state_idx: (batch,), dtype int32

The pointer into initial_states, which signifies the cache block containing the initial state.

Thank you for the catch. I think the description and the naming was a bit off. The tensor initial_state_idx indexes into the conv_state_indices and with it points to the location of the initial states. I updated the description there. Please let me know if it makes more sense now.

Signed-off-by: Thomas Parnell <[email protected]>

tlrmchlsmth

Looks pretty clean -- a lot less invasive than I thought it would be!

tlrmchlsmth · 2025-10-03T21:15:51Z

vllm/model_executor/layers/mamba/ops/causal_conv1d.py

        KERNEL_WIDTH=width,
        SILU_ACTIVATION=activation in ["silu", "swish"],
        IS_VARLEN=query_start_loc is not None,
-        IS_CONTINUOUS_BATCHING=conv_state_indices is not None,


Do we assume this is always true now?

Yes. We've massively cleaned up the mamba2 kernels to remove these unused logic but the causal_conv1d could actually use another pass through it imo. We can do it as follow-up.

(e.g., stuff like IS_VARLEN can also be stripped out)

As @tdoublep mentioned the kernels have been cleaned up altogether quite a lot, but they are still not perfect. Especially the conv1D kernel. They can be simplified quite a bit, I believe.

Signed-off-by: Thomas Ortner <[email protected]>

Signed-off-by: Thomas Parnell <[email protected]>

vllm/model_executor/models/config.py

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-10-04T04:34:50Z

Tracking issue for follow-ups: #26201

heheda12345

Sorry for my late review. I've added some small comments. Can you update them in a future PR?

heheda12345 · 2025-10-04T04:40:35Z