Skip to content

Conversation

@WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Aug 17, 2025

Currently, using spec decoding increases TTFT from target_model_prefill_time to target_model_prefill_time + draft_model_prefill_time, because the first token is returned together with the draft token ids.

This PR avoids this fake dependency by restructuring the step method, so that the sampled tokens can be returned to the scheduler without waiting for the draft tokens.

This optimization reduces TTFT by 0-30%, for long prefill or high QPS cases, while TPOT and overall throughput remain unchanged.

Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added v1 tpu Related to Google TPUs labels Aug 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the speculative decoding logic to make draft token proposal non-blocking, which should improve the time-to-first-token (TTFT). The core idea is to decouple returning the sampled tokens from proposing draft tokens for the next step. This is achieved by caching the proposed draft tokens in the GPUModelRunner and processing them at the beginning of the next engine step. The changes appear logically sound and well-implemented towards this goal. However, I've identified a critical issue where the change in return type for propose_draft_token_ids is not handled for the 'medusa' speculative decoding method, which will lead to a runtime error.

Comment on lines +1838 to 1841
draft_token_ids = self.drafter.propose(
target_hidden_states=hidden_states,
sampling_metadata=sampling_metadata,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The return type of propose_draft_token_ids has been changed to Union[list[np.ndarray], torch.Tensor]. However, when using the "medusa" speculative decoding method, self.drafter.propose (from MedusaProposer) returns a list[list[int]]. This violates the new type hint and will cause a runtime AttributeError in scheduler.update_draft_token_ids when it tries to call .tolist() on a list[int] object.

To fix this, the output from the medusa proposer should be converted to a list[np.ndarray] to be consistent with the new return type.

Suggested change
draft_token_ids = self.drafter.propose(
target_hidden_states=hidden_states,
sampling_metadata=sampling_metadata,
)
draft_token_ids_list = self.drafter.propose(
target_hidden_states=hidden_states,
sampling_metadata=sampling_metadata)
draft_token_ids = [np.array(t, dtype=np.int32) for t in draft_token_ids_list]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the type annotation in Medusa and added a TODO on the optimization. cc @skylee-01

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skylee-01 Can you please write a PR that implements this optimization? Basically, we want the medusa proposer to return a GPU tensor of shape [num_reqs, num_spec_tokens] instead of list[list[int]].

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 17, 2025
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
@WoosukKwon
Copy link
Collaborator Author

cc @njhill @robertgshaw2-redhat

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WoosukKwon, this looks great

@WoosukKwon WoosukKwon merged commit c9b38be into main Aug 19, 2025
35 of 42 checks passed
@WoosukKwon WoosukKwon deleted the woosuk/spec-ttft branch August 19, 2025 00:20
adobrzyn added a commit to vllm-project/vllm-gaudi that referenced this pull request Aug 19, 2025
princepride pushed a commit to princepride/vllm that referenced this pull request Aug 20, 2025
divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025
cyang49 pushed a commit to cyang49/vllm that referenced this pull request Aug 20, 2025
djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025
slokesha pushed a commit to slokesha/vllm-gaudi that referenced this pull request Aug 27, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants