[Spec Decode] Make `propose_draft_token_ids` non-blocking for lower TTFT #23041

WoosukKwon · 2025-08-17T01:44:45Z

Currently, using spec decoding increases TTFT from target_model_prefill_time to target_model_prefill_time + draft_model_prefill_time, because the first token is returned together with the draft token ids.

This PR avoids this fake dependency by restructuring the step method, so that the sampled tokens can be returned to the scheduler without waiting for the draft tokens.

This optimization reduces TTFT by 0-30%, for long prefill or high QPS cases, while TPOT and overall throughput remain unchanged.

Signed-off-by: Woosuk Kwon <[email protected]>

github-actions · 2025-08-17T01:44:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request refactors the speculative decoding logic to make draft token proposal non-blocking, which should improve the time-to-first-token (TTFT). The core idea is to decouple returning the sampled tokens from proposing draft tokens for the next step. This is achieved by caching the proposed draft tokens in the GPUModelRunner and processing them at the beginning of the next engine step. The changes appear logically sound and well-implemented towards this goal. However, I've identified a critical issue where the change in return type for propose_draft_token_ids is not handled for the 'medusa' speculative decoding method, which will lead to a runtime error.

gemini-code-assist · 2025-08-17T01:46:54Z

vllm/v1/worker/gpu_model_runner.py

+            draft_token_ids = self.drafter.propose(
                target_hidden_states=hidden_states,
                sampling_metadata=sampling_metadata,
            )


The return type of propose_draft_token_ids has been changed to Union[list[np.ndarray], torch.Tensor]. However, when using the "medusa" speculative decoding method, self.drafter.propose (from MedusaProposer) returns a list[list[int]]. This violates the new type hint and will cause a runtime AttributeError in scheduler.update_draft_token_ids when it tries to call .tolist() on a list[int] object.

To fix this, the output from the medusa proposer should be converted to a list[np.ndarray] to be consistent with the new return type.

Suggested change

draft_token_ids = self.drafter.propose(

target_hidden_states=hidden_states,

sampling_metadata=sampling_metadata,

)

draft_token_ids_list = self.drafter.propose(

target_hidden_states=hidden_states,

sampling_metadata=sampling_metadata)

draft_token_ids = [np.array(t, dtype=np.int32) for t in draft_token_ids_list]

Fixed the type annotation in Medusa and added a TODO on the optimization. cc @skylee-01

Thanks for the work.

@skylee-01 Can you please write a PR that implements this optimization? Basically, we want the medusa proposer to return a GPU tensor of shape [num_reqs, num_spec_tokens] instead of list[list[int]].

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon · 2025-08-17T17:12:25Z

cc @njhill @robertgshaw2-redhat

njhill

Thanks @WoosukKwon, this looks great

vllm/v1/worker/gpu_model_runner.py

vllm/v1/executor/abstract.py

Signed-off-by: Woosuk Kwon <[email protected]>

Culprit PR: vllm-project/vllm#23041 Signed-off-by: Agata Dobrzyniewicz <[email protected]>

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Duncan Moss <[email protected]>

Culprit PR: vllm-project/vllm#23041 Signed-off-by: Agata Dobrzyniewicz <[email protected]>

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon added 3 commits August 16, 2025 17:38

ttft

9281b4a

Signed-off-by: Woosuk Kwon <[email protected]>

minor

5c2a40a

Signed-off-by: Woosuk Kwon <[email protected]>

cleanup

95c6b75

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon requested review from alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 17, 2025 01:44

mergify bot added v1 tpu Related to Google TPUs labels Aug 17, 2025

gemini-code-assist bot reviewed Aug 17, 2025

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 17, 2025

WoosukKwon added 6 commits August 16, 2025 21:54

fix test

ac9c171

Signed-off-by: Woosuk Kwon <[email protected]>

use list[int]

a1d4120

Signed-off-by: Woosuk Kwon <[email protected]>

Merge branch 'main' into woosuk/spec-ttft

9958318

fix

0dad088

Signed-off-by: Woosuk Kwon <[email protected]>

fix medusa

257dac0

Signed-off-by: Woosuk Kwon <[email protected]>

medusa comment

1d7c302

Signed-off-by: Woosuk Kwon <[email protected]>

mergify bot added the speculative-decoding label Aug 17, 2025

WoosukKwon added 4 commits August 17, 2025 02:04

add post_step

0aae7fc

Signed-off-by: Woosuk Kwon <[email protected]>

fix bug

c582271

Signed-off-by: Woosuk Kwon <[email protected]>

comment

a6a16e6

Signed-off-by: Woosuk Kwon <[email protected]>

minor optimization

dc416b3

Signed-off-by: Woosuk Kwon <[email protected]>

njhill approved these changes Aug 18, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

vllm/v1/executor/abstract.py Outdated Show resolved Hide resolved

WoosukKwon added 3 commits August 18, 2025 12:57

Merge branch 'main' into woosuk/spec-ttft

b1781ba

get -> take

b654da7

Signed-off-by: Woosuk Kwon <[email protected]>

minor

4735bbb

Signed-off-by: Woosuk Kwon <[email protected]>

njhill approved these changes Aug 18, 2025

View reviewed changes

Merge branch 'main' into woosuk/spec-ttft

22a8533

WoosukKwon merged commit c9b38be into main Aug 19, 2025
35 of 42 checks passed

WoosukKwon deleted the woosuk/spec-ttft branch August 19, 2025 00:20

adobrzyn mentioned this pull request Aug 19, 2025

[Upstream fix] Fix after #23041 from upstream vllm-project/vllm-gaudi#87

Merged

MengqingCao mentioned this pull request Aug 19, 2025

[AclGraph] Adapt aclgraph into new graph dispatcher arch vllm-project/vllm-ascend#2427

Closed

adobrzyn added a commit to vllm-project/vllm-gaudi that referenced this pull request Aug 19, 2025

[Upstream fix] Fix after #23041 from upstream (#87)

0492c55

Culprit PR: vllm-project/vllm#23041 Signed-off-by: Agata Dobrzyniewicz <[email protected]>

princepride pushed a commit to princepride/vllm that referenced this pull request Aug 20, 2025

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

b783ed4

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

f4aa6ed

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

cyang49 pushed a commit to cyang49/vllm that referenced this pull request Aug 20, 2025

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

346435b

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

a9b22c0

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Duncan Moss <[email protected]>

slokesha pushed a commit to slokesha/vllm-gaudi that referenced this pull request Aug 27, 2025

[Upstream fix] Fix after #23041 from upstream (vllm-project#87)

0a34211

Culprit PR: vllm-project/vllm#23041 Signed-off-by: Agata Dobrzyniewicz <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

765d6ad

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

19ffc15

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

0c50677

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

7e0bb13

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

1170249

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

shadowpa0327 mentioned this pull request Oct 21, 2025

[Bug]: Speculative Decoding Issue with VLLM_ENABLE_V1_MULTIPROCESSING=0 #27287

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Spec Decode] Make `propose_draft_token_ids` non-blocking for lower TTFT #23041

[Spec Decode] Make `propose_draft_token_ids` non-blocking for lower TTFT #23041

Uh oh!

WoosukKwon commented Aug 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 17, 2025

Uh oh!

WoosukKwon Aug 17, 2025

Uh oh!

skylee-01 Aug 18, 2025

Uh oh!

WoosukKwon Aug 18, 2025

Uh oh!

WoosukKwon commented Aug 17, 2025

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT #23041

[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT #23041

Uh oh!

Conversation

WoosukKwon commented Aug 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

skylee-01 Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Aug 17, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Spec Decode] Make `propose_draft_token_ids` non-blocking for lower TTFT #23041

[Spec Decode] Make `propose_draft_token_ids` non-blocking for lower TTFT #23041

WoosukKwon commented Aug 17, 2025 •

edited by github-actions bot

Loading