-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Open
Labels
feature requestNew feature or requestNew feature or request
Description
🚀 The feature, motivation and pitch
This issue is meant to track follow-up work items for the Mamba2 Automatic Prefix Caching. Below is a non-exhaustive list of already identified work items:
- Address some comments leftover from review ([V1] [Hybrid] Some additional clean-up in Mamba2 prefix caching #26222)
- Implement policy for freeing mamba blocks to fix performance in throughput benchmarks
- Relax constraint that mamba block size must be multiple of chunk size
- Give user flexibility to set mamba caching granularity ([Hybrid] Add mamba_block_size to Engine Args #27289)
- Support mamba prefix caching and spec decode
- Fuse logic for SSM state writing into kernels ([V1] Mamba2 SSD kernel integration #26235)
- Test TP>1 behaviour
- Cache meta-data builds across KV cache groups ([Attention] Cache attention metadata builds across hybrid KV-cache groups #22788)
- Additional cleanup in causal_conv1d kernels (e.g., strip out unused logic)
- Enable prefix caching for Mamba1 ([V1] [Hybrid] Mamba1 Automatic Prefix Caching #26377)
- Enable prefix caching for ShortConv
- Enable prefix caching for LinearAttention
- Enable prefix caching for GDN ([V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807)
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request