[Feature]: Per-sequence speculative decoding

### 🚀 The feature, motivation and pitch


 ### **1. Problem**

Currently, increasing batch size in vLLM's Speculative Decoding inference causes inefficiency.
When using the LLaMA 1B SSM model on the LLaMA 70B Original model, a performance reversal occurs at Batchsize 32.
In addition, when the _num_speculative_tokens_ (SL; speculative length) is large, the inefficiency increases further as the batch size increases (Fig. 2).

vLLM was also aware of the need for optimization for this. (https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit?tab=t.0#heading=h.kk7dq05lc6q8)

<img src="https://github.com/user-attachments/assets/2cf95b49-e528-46eb-bd9b-f053cc057868" width="600" height="auto">

<img src="https://github.com/user-attachments/assets/0698112e-8b61-495a-b46e-a4187a4138ea" width="650" height="auto">



 ### **2. Previous work in vLLM** 

To handle the increasing batch size in SD, vLLM has been performing the following tasks: Batch Expansion https://github.com/vllm-project/vllm/pull/3103 and  MQA (Multi-Query Attention) Scorer
![Image](https://github.com/user-attachments/assets/d1585ce6-67f8-4650-b1e7-8182193aca37)

"Batch expansion" expands the batch by the factor of k (num_speculative_tokens). Each original sequence + one proposal token become a separate sequence in the expanded batch for the target model's scoring pass. Because of the "expansion" it has drawn backs as it increases memory usage and attention calculation by factor K. 
To overcome the drawbacks, MQAScorer utilizes specialized MQA kernels (when available) to score all k proposal tokens for each sequence without expanding the batch size explicitly.

Both Batch Expansion and MQA Scorer face performance degradation on dynamic shape & CUDA Graphs. if K ( _num_speculative_tokens_) changes each iteration or across sequences. CUDA graph can't hadling the dynamic shape/size.

Ultimately, the current vLLM batch size processing method can only handle **static SL** that the K ( _num_speculative_tokens_) set before inference.

 
### **3. Dynamic SL** 
Even if a piece-wise CUDA graph is applied to MQA, the inefficiency caused by each sequence in a batch having a FixedSL value is not resolved. Also, when using models such as EAGLE2 that use tree-attetion, you need to use SL 32, etc., but as the batch grows, the MQA padding space grows larger.

The following experiment is the result of checking until rejection occurs after setting max SL to 60. The experimental results show that OracleSL has a large range from 1 to 60, and that it infers 117 times less than when inferred by setting it to the optimal SL value of 4. Since the target model is 8B, there was no significant difference in speed, but if the target model is large, the difference between OracleSL and StaticSL will be large.

![Image](https://github.com/user-attachments/assets/bfd26b78-5c48-4e38-8c42-4bf9c362ed7e)


Previous papers have proven that it is effective to have a different SL for each sequence and a different SL for each iteration.  

> [BASS](https://arxiv.org/abs/2404.15778), 
> [DISCO](https://arxiv.org/pdf/2405.04304),
> [SPRINTER](https://www.arxiv.org/abs/2502.04557), 
> [AdaEDL](https://arxiv.org/abs/2410.18351) 
> 


### **4. CONCLUSION** 
In conclusion, **per-sequence decoding** is necessary to apply dynamicSL to each iteration and each sequence in the batch and to resolve inefficiency due to increasing batch size.

My team are implementing per-sequence decoding in the flash-attention2 kernel. However, we are currently developing with vLLM 0.8.4, so it would be nice if the schedule for SD updates in vLLM V1 could be in sync.





### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Per-sequence speculative decoding #17984

🚀 The feature, motivation and pitch

1. Problem

2. Previous work in vLLM

3. Dynamic SL

4. CONCLUSION

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Per-sequence speculative decoding #17984

Description

🚀 The feature, motivation and pitch

1. Problem

2. Previous work in vLLM

3. Dynamic SL

4. CONCLUSION

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions