-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching #26807
Conversation
Signed-off-by: simondanielsson <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: simondanielsson <[email protected]>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <[email protected]> Signed-off-by: FENP <[email protected]> Signed-off-by: Jaya Yuan <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
0e64636 to
1d3afe0
Compare
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
…make sure prefill block-history indexing captures decode chunks Signed-off-by: simondanielsson <[email protected]>
…ng GDN_RECOMPUTE_SUPPRESS_LEVEL Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
Signed-off-by: simondanielsson <[email protected]>
|
@codex review |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting |
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
tdoublep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! I have some initial comments and questions.
I can help to benchmark this on H100
| GDN_MODELS = ["tiny-random/qwen3-next-moe"] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any specific reason to split it off into GDN models?
| ): | ||
| raise ValueError( | ||
| "GDN prefix caching requires the mamba block size to be a " | ||
| "multiple of the kernel chunk size." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe include self.chunk_size in the error message to help guide the user to set it correctly?
|
|
||
| # Decode-side APC metadata | ||
| state_indices_tensor_d: torch.Tensor | None = None | ||
| state_indices_tensor_p: torch.Tensor | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this tensor with _p to the section below?
| self.state_indices_tensor_p_buf = torch.empty( | ||
| (self.decode_cudagraph_max_bs, self._max_cached_blocks), | ||
| dtype=torch.int32, | ||
| device=device, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to use these buffers to tensors that relate to prefill, because we don't use full CUDA graphs for batches that contain prefills.
| max_num_prefill_chunks = ( | ||
| cdiv(vllm_config.model_config.max_model_len, self.chunk_size) | ||
| * self.decode_cudagraph_max_bs | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why the number of prefill chunks could be related to the maximum decode-only batch size
| self.cu_chunk_seqlen_p_buf = torch.empty( | ||
| (max_num_prefill_chunks + 1,), | ||
| dtype=torch.int32, | ||
| device=device, | ||
| ) | ||
| self.last_chunk_indices_p_buf = torch.empty( | ||
| (self.decode_cudagraph_max_bs,), | ||
| dtype=torch.int32, | ||
| device=device, | ||
| ) | ||
| self.num_computed_tokens_p_buf = torch.empty( | ||
| (self.decode_cudagraph_max_bs,), | ||
| dtype=torch.int32, | ||
| device=device, | ||
| ) | ||
| self.block_idx_first_scheduled_token_p_buf = torch.empty( | ||
| (self.decode_cudagraph_max_bs,), | ||
| dtype=torch.int32, | ||
| device=device, | ||
| ) | ||
| self.block_idx_last_computed_token_p_buf = torch.empty( | ||
| (self.decode_cudagraph_max_bs,), | ||
| dtype=torch.int32, | ||
| device=device, | ||
| ) | ||
| self.block_idx_last_scheduled_token_p_buf = torch.empty( | ||
| (self.decode_cudagraph_max_bs,), | ||
| dtype=torch.int32, | ||
| device=device, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for all of these prefill tensors - why do we need to use static buffers?
| torch.int32 | ||
| ) | ||
|
|
||
| if spec_sequence_masks is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this PR does not attempt to support APC + spec dec. Could we simplify this logic by just raising if spec decode is enabled?
| num_computed_tokens_cpu_non_spec = m.num_computed_tokens_cpu | ||
|
|
||
| if num_decodes > 0: | ||
| state_indices_tensor_d = non_spec_block_table[:num_decodes].contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the .contiguous on these?
| cu_chunk_seqlen: list[int] = [] | ||
| seq_idx_list: list[int] = [] | ||
| last_chunk_indices_list: list[int] = [] | ||
| seqlen_pos = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these actually get used by the model?
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Part of #26201.
Adds Automatic Prefix Caching for GDN. Tries to be similar to APC for Mamba2 as introduced in #25752.
Specifically:
Qwen3NextGatedDeltaNetto recycle cached states during decode by copying the last computed block into the newly scheduled slot, and during prefill to replay the returned chunk history into persistent SSM cache blocks so later tokens can hit the prefix cacheLatency benchmark (APC ("default") vs no-APC ("default-noapc")):

TODOs:
GDN_RECOMPUTE_SUPPRESS_LEVEL=4.Outstanding tasks, not captured here:
Test Plan
Note: this runs only with the tiny
tiny-random/qwen3-next-moemodel, as I only have an L4 with 20GB VRAM. Would be great if someone could try also with Qwen3-Next-80B-A3BTest Result
Note: gibberish output due to random model.
No cudagraphs (
enforce_eager=True):With cudagraphs (
enforce_eager=False):Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.