-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT
#23041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the speculative decoding logic to make draft token proposal non-blocking, which should improve the time-to-first-token (TTFT). The core idea is to decouple returning the sampled tokens from proposing draft tokens for the next step. This is achieved by caching the proposed draft tokens in the GPUModelRunner and processing them at the beginning of the next engine step. The changes appear logically sound and well-implemented towards this goal. However, I've identified a critical issue where the change in return type for propose_draft_token_ids is not handled for the 'medusa' speculative decoding method, which will lead to a runtime error.
| draft_token_ids = self.drafter.propose( | ||
| target_hidden_states=hidden_states, | ||
| sampling_metadata=sampling_metadata, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type of propose_draft_token_ids has been changed to Union[list[np.ndarray], torch.Tensor]. However, when using the "medusa" speculative decoding method, self.drafter.propose (from MedusaProposer) returns a list[list[int]]. This violates the new type hint and will cause a runtime AttributeError in scheduler.update_draft_token_ids when it tries to call .tolist() on a list[int] object.
To fix this, the output from the medusa proposer should be converted to a list[np.ndarray] to be consistent with the new return type.
| draft_token_ids = self.drafter.propose( | |
| target_hidden_states=hidden_states, | |
| sampling_metadata=sampling_metadata, | |
| ) | |
| draft_token_ids_list = self.drafter.propose( | |
| target_hidden_states=hidden_states, | |
| sampling_metadata=sampling_metadata) | |
| draft_token_ids = [np.array(t, dtype=np.int32) for t in draft_token_ids_list] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the type annotation in Medusa and added a TODO on the optimization. cc @skylee-01
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skylee-01 Can you please write a PR that implements this optimization? Basically, we want the medusa proposer to return a GPU tensor of shape [num_reqs, num_spec_tokens] instead of list[list[int]].
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @WoosukKwon, this looks great
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Culprit PR: vllm-project/vllm#23041 Signed-off-by: Agata Dobrzyniewicz <[email protected]>
…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>
…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>
…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>
…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
Culprit PR: vllm-project/vllm#23041 Signed-off-by: Agata Dobrzyniewicz <[email protected]>
…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>
…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>
…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>
…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>
Currently, using spec decoding increases TTFT from
target_model_prefill_timetotarget_model_prefill_time + draft_model_prefill_time, because the first token is returned together with the draft token ids.This PR avoids this fake dependency by restructuring the
stepmethod, so that the sampled tokens can be returned to the scheduler without waiting for the draft tokens.This optimization reduces TTFT by 0-30%, for long prefill or high QPS cases, while TPOT and overall throughput remain unchanged.