Skip to content

Commit b8aaf26

Browse files
sroy745Alvant
authored andcommitted
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM (vllm-project#7962)
Signed-off-by: Alvant <[email protected]>
1 parent 0787988 commit b8aaf26

File tree

2 files changed

+59
-0
lines changed

2 files changed

+59
-0
lines changed

docs/source/models/spec_decode.rst

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,46 @@ A variety of speculative models of this type are available on HF hub:
161161
* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
162162
* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
163163

164+
Lossless guarantees of Speculative Decoding
165+
-------------------------------------------
166+
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
167+
speculative decoding, breaking down the guarantees into three key areas:
168+
169+
1. **Theoretical Losslessness**
170+
- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
171+
cause slight variations in output distributions, as discussed
172+
in `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/pdf/2302.01318>`_
173+
174+
2. **Algorithmic Losslessness**
175+
- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
176+
177+
- **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
178+
distribution. `View Test Code <https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252>`_
179+
180+
- **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
181+
without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
182+
provides a lossless guarantee. Almost all of the tests in `this directory <https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e>`_
183+
verify this property using `this assertion implementation <https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291>`_
184+
185+
3. **vLLM Logprob Stability**
186+
- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
187+
same request across runs. For more details, see the FAQ section
188+
titled *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
189+
190+
191+
**Conclusion**
192+
193+
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
194+
can occur due to following factors:
195+
196+
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
197+
198+
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
199+
due to non-deterministic behavior in batched operations or numerical instability.
200+
201+
**Mitigation Strategies**
202+
203+
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
164204

165205
Resources for vLLM contributors
166206
-------------------------------

docs/source/serving/faq.rst

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,22 @@ A: Assuming that you're referring to using OpenAI compatible server to serve mul
1010
Q: Which model to use for offline inference embedding?
1111

1212
A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
13+
14+
----------------------------------------
15+
16+
Q: Can the output of a prompt vary across runs in vLLM?
17+
18+
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
19+
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
20+
see the `Numerical Accuracy section <https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations>`_.
21+
22+
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
23+
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
24+
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
25+
different tokens being sampled. Once a different token is sampled, further divergence is likely.
26+
27+
**Mitigation Strategies**
28+
29+
- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
30+
- If using `bfloat16`, switching to `float16` can also help.
31+
- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.

0 commit comments

Comments
 (0)