You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
167
+
speculative decoding, breaking down the guarantees into three key areas:
168
+
169
+
1. **Theoretical Losslessness**
170
+
- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
171
+
cause slight variations in output distributions, as discussed
172
+
in `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/pdf/2302.01318>`_
173
+
174
+
2. **Algorithmic Losslessness**
175
+
- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
176
+
177
+
- **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
178
+
distribution. `View Test Code <https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252>`_
179
+
180
+
- **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
181
+
without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
182
+
provides a lossless guarantee. Almost all of the tests in `this directory <https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e>`_
183
+
verify this property using `this assertion implementation <https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291>`_
184
+
185
+
3. **vLLM Logprob Stability**
186
+
- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
187
+
same request across runs. For more details, see the FAQ section
188
+
titled *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
189
+
190
+
191
+
**Conclusion**
192
+
193
+
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
194
+
can occur due to following factors:
195
+
196
+
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
197
+
198
+
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
199
+
due to non-deterministic behavior in batched operations or numerical instability.
200
+
201
+
**Mitigation Strategies**
202
+
203
+
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
Copy file name to clipboardExpand all lines: docs/source/serving/faq.rst
+19Lines changed: 19 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,3 +10,22 @@ A: Assuming that you're referring to using OpenAI compatible server to serve mul
10
10
Q: Which model to use for offline inference embedding?
11
11
12
12
A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
13
+
14
+
----------------------------------------
15
+
16
+
Q: Can the output of a prompt vary across runs in vLLM?
17
+
18
+
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
19
+
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
20
+
see the `Numerical Accuracy section <https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations>`_.
21
+
22
+
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
23
+
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
24
+
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
25
+
different tokens being sampled. Once a different token is sampled, further divergence is likely.
26
+
27
+
**Mitigation Strategies**
28
+
29
+
- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
30
+
- If using `bfloat16`, switching to `float16` can also help.
31
+
- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
0 commit comments