-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Spec Decode][Hybrid] Add ngram-eagle SD method #24344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Spec Decode][Hybrid] Add ngram-eagle SD method #24344
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new speculative decoding method, ngram-eagle
, which combines n-gram based proposals with EAGLE proposals. The changes include adding the new method to configuration options, updating the speculative decoding logic to handle the combined approach, and modifying the example script to support it. The implementation correctly initializes both n-gram and EAGLE proposers when ngram-eagle
is selected and combines their outputs. My review found one critical issue in the configuration validation logic that should be addressed.
Signed-off-by: Ekagra Ranjan <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ekagra Ranjan <[email protected]>
3e05ce9
to
22fd4a6
Compare
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
b7bb0e6
to
6b8210b
Compare
This pull request has merge conflicts that must be resolved before it can be |
This is great enablement! When can we merge this into vllm v1 mainline? |
Wonder if your implementation works with EAGLE 3 in vllm v1? and whether the performance gain you established will hold true for higher concurrency level? many thanks! |
Thanks @Neo9061 . I am waiting for reviews from @WoosukKwon and @LiuXiaoxuanPKU.
Eagle 3 is left for future PR. It will be straight forward and I leave it to the OSS.
SD in general doesnt hold good for very high concurrency and is not a byproduct of this method. |
# combine ngram and eagle drafts | ||
# prefer ngram drafts when available | ||
# choose eagle drafts when ngram drafts are empty | ||
for bid in range(len(draft_token_ids_ngram)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean you always use both ngram and eagle to do generate speculation proposals? Isn't it more efficient to generate eagle proposals only when there are no valid ngram proposals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In multi batch setting, if even 1 seq is running EAGLE then its almost the same cost as all req running it. The current implementation is easier to implement. Future improvements can be made to further optimize on lower batch settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ekagra-ranjan, similar question, in your code where for each request, you sequentially first use n-gram to generate draft_token_ids_ngram
and then use eagle to generate draft_token_ids_eagle
. Can you share insights why such hybrid approach can be faster than EAGLE alone? I think the speedup from hybrid approach is to skip the EAGLE auto-regressive drafting if we have some results from n-gram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Neo9061 while EAGLE drafting cost is certainly non-trivial, I think the main benefit of this method is that it can verify both long-but-rare speculated sequences (ngram) as well as short-but-accurate speculated sequences (eagle) together. this way, a deployment can get the benefits from either of ngram or eagle, whichever would have a better prediction accuracy on each token. I think of it less as an EAGLE-taxed ngram deployment and more of an ngram-augmented EAGLE deployment that gets the widespread speedup from EAGLE as a baseline, and for some cases it gets to leverage ngram for much higher AL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, it does look like a free win to skip EAGLE drafting when all requests in the batch get an ngram hit. This might not be likely for BS >> 1, but for low-latency single-request this might actually pay off noticeably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks! @benchislett
Another question, probably for @ekagra-ranjan , if we want to use potentially longer drafting length from n-gram, will make the number of speculative tokens for n-garm part higher than 5 give you lower TPOT? I saw in your benchmarking above, you use the same drafting length as the EAGLE.
I am sure you are aware of Suffix decoding PR, compared to n-gram, suffix decoding without cache (an equivalent comparison, i.e. not making global tree caches the results from previous requests) can search over both prompt + intermediately generated tokens. N-gram can only search over the prompt part (
vllm/vllm/v1/spec_decode/ngram_proposer.py
Line 178 in f1fc210
context_token_ids = token_ids_cpu[idx, :num_tokens] |
Wonder if any plan to integrate suffix decoding with eagle?
Please help fix the merge conflict, ty |
Signed-off-by: Ekagra Ranjan <[email protected]>
if self.num_speculative_tokens_per_method is not None: | ||
if isinstance(self.num_speculative_tokens_per_method, str): | ||
self.num_speculative_tokens_per_method = json.loads( | ||
self.num_speculative_tokens_per_method) | ||
assert isinstance(self.num_speculative_tokens_per_method, dict), ( | ||
"num_speculative_tokens_per_method must be a dict or a json " | ||
"string that can be converted to a dict.") | ||
assert all( | ||
isinstance(v, int) and v > 0 | ||
for v in self.num_speculative_tokens_per_method.values()), ( | ||
"All values in num_speculative_tokens_per_method must be " | ||
"positive integers.") | ||
max_num_speculative_tokens = max( | ||
self.num_speculative_tokens_per_method.values()) | ||
if self.num_speculative_tokens is None: | ||
self.num_speculative_tokens = max_num_speculative_tokens | ||
else: | ||
assert self.num_speculative_tokens <= \ | ||
max_num_speculative_tokens, ( | ||
"num_speculative_tokens should be None or must be" | ||
" less than or equal to the " | ||
"max value in num_speculative_tokens_per_method.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why bother with str
? The CLI can parse JSON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I didnt know about that. Fixed.
Signed-off-by: Ekagra Ranjan <[email protected]>
if self.use_ngram() and not self.disable_padded_drafter_batch: | ||
logger.warning( | ||
"padded_drafter_batch has to be disabled with ngram. " | ||
"Setting it disable_padded_drafter_batch to True.") | ||
self.disable_padded_drafter_batch = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benchislett - jfyi, padded_drafter_batch has been disabled by default for nrgam and ngram-eagle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it has to be disabled, but it's likely a good decision to do so.
Signed-off-by: Ekagra Ranjan <[email protected]>
self.propose([[]] * 1024, [""] * 1024, np.zeros(1024, dtype=np.int32), | ||
np.zeros((1024, self.max_model_len), dtype=np.int32), | ||
set()) | ||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to be left in?
# use ifs and not elifs to allow multiple | ||
# draft models to be initialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: clarity
# use ifs and not elifs to allow multiple | |
# draft models to be initialized | |
# allow multiple draft methods to be used together |
use_padded_batch_for_eagle = self.speculative_config and \ | ||
self.speculative_config.use_eagle() and \ | ||
not self.speculative_config.disable_padded_drafter_batch | ||
not self.speculative_config.disable_padded_drafter_batch and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you considered keeping padded_drafter_batch
for eagle drafting and then doing ngram separately? If both methods are going to be used anyway, this seems possible? do you think there's a benefit to keeping padded_drafter_batch
in this case?
This pull request has merge conflicts that must be resolved before it can be |
Addresses: #18633
Adds a new SD approach combing the best of Ngram and EAGLE. Besides working on algorithm, this needed additional work on data and metrics as explained below.
Algorithm
The RFC discusses more on the motivation and proposed algorithm. The major change is that we can now run multiple drafters in single step. This PR allows only Ngram and Eagle to simultaneously. This can be generalized to other combinations but at the moment I think only ngram (weight free approach) and other approach which needs trained weights would be of value. This PR does not add Ngram + Eagle3 and is left for future work if someone is interested.
Dataset
Use Blazedit with max norm edit distance of 0.25 which means at max 25% of output will be different from the input. The min edit distance in this dataset is 0.1. Check the PR out for more detail on the need for this dataset: #23605
Metric
The AL reported in
offline_inference/spec_decode.py
takes into account only when the draft was made for a sequence. AssumeThe current formula for AL will measured it as 3, i.e., K+1. However, my intuition is that AL should be measured as how many tokens are accepted in forward pass normalized across sequence generation. A sequence normalized AL would be = num of total token generated / num of fwd pass = 10 / (2+4) = 10/6 = 1.66. This sequence normalized AL is more realistic in the sense that it gives the expected speedup from SD assuming zero overhead and inefficiency from the SD method. This correction is important to make. Without this, ngram will have much higher AL than Eagle on datasets like Instruct Coder but TPOT doesn't reflect that. The current AL computation is more like computing precision whereas to get navigate which method is better just by looking at AL and assuming zero overhead, we need sequence normalized AL.
This PR estimates the sequence normalized AL by finding how many total tokens are generated, how many of them were generated by SD, how many times draft was made and how many tokens were generated without SD. More specifically,
In some cases,
num_tokens_generated_without_sd
is negative. This is because of boundary condition where we have <K tokens to predict and we predict K tokens and all K were accepted but the final output has <K tokens. This error is bounded by fraction of((K-1)*num of samples / num of samples * output len per sample) = ((K-1)/output len per sample)
. Not all samples will run into this boundary condition therefore this is an upper bound and for K=5 and output len 256 this is ~1.5%. Empirically, this was found to be <1% in below benchmarks. Therefore, the impact of this estimation is very negligible on the final results.Benchmarks
Offline Inference (AL)
Blazedit max edit norm distance: 0.25
method: ngram-eagle
cmd:
python3 examples/offline_inference/spec_decode.py --method ngram-eagle --num-speculative-tokens-per-method "{\"ngram\": 5, \"eagle\": 3}" --prompt_lookup_max 5 --prompt_lookup_min 2 --tp 1 --dataset-name hf --dataset-path vdaita/edit_5k_char --num-prompts 90 --hf-output-len 2048 --blazedit-min-distance 0.01 --blazedit-max-distance 0.25 --no-oversample --print-output
output
higher precision ngram-eagle by increasing
--prompt_lookup_min
to 5cmd:
python3 examples/offline_inference/spec_decode.py --method n gram-eagle --num-speculative-tokens-per-method "{\"ngram\": 5, \"eagle\": 3}" --prompt_lookup_max 5 --prompt_lookup_min 5 --tp 1 --dataset-name hf --dataset-path vdaita/edit_5k_char --num-prompts 90 --hf-output-len 2048 --blazedit-min-distance 0.01 --blazedit-max-distance 0.25 --no-oversample
output
method: eagle
cmd:
python3 examples/offline_inference/spec_decode.py --method eagle --num_spec_tokens 3 --tp 1 --dataset-name hf --dataset-path vdaita/edit_5k_char --num-prompts 90 --hf-output-len 2048 --blazedit-min-distance 0.01 --blazedit-max-distance 0.25 --no-oversample --print-output
output
method: ngram
cmd:
python3 examples/offline_inference/spec_decode.py --method ngram --num_spec_tokens 5 --prompt_lookup_max 5 --prompt_lookup_min 2 --tp 1 --dataset-name hf --dataset-path vdaita/edit_5k_char --num-prompts 90 --hf-output-len 2048 --blazedit-min-distance 0.01 --blazedit-max-distance 0.25 --no-oversample --print-output
output
MTBench
method: ngram-eagle
cmd:
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram-eagle --num-speculative-tokens-per-method "{\"ngram\": 5, \"eagle\": 3}" --prompt_lookup_max 5 --prompt_lookup_min 2 --tp 1 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output
output
higher precision ngram-eagle by increasing
--prompt_lookup_min
to 5cmd:
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram-eagle --num-speculative-tokens-per-method "{\"ngram\": 5, \"eagle\": 3}" --prompt_lookup_max 5 --prompt_lookup_min 5 --tp 1 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --printt-output
output
method: eagle
cmd:
python3 examples/offline_inference/spec_decode.py --method eagle --num_spec_tokens 3 --tp 1 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output
output
method: ngram
cmd:
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram --num_spec_tokens 5 --prompt_lookup_max 5 --prompt_lookup_min 2 --tp 1 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output
output
Instruct Code
method: ngram-eagle
cmd:
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram-eagle --num-speculative-tokens-per-method "{\"ngram\": 5, \"eagle\": 3}" --prompt_lookup_max 5 --prompt_lookup_min 2 --tp 1 --dataset-name hf --dataset-path likaixin/InstructCoder --num-prompts 1000 --print-output
output
method: eagle
cmd:
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method eagle --num_spec_tokens 3 --tp 1 --dataset-name hf --dataset-path likaixin/InstructCoder --num-prompts 1000 --print-output
output
method: ngram
cmd:
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram --num_spec_tokens 5 --prompt_lookup_max 5 --prompt_lookup_min 2 --tp 1 --dataset-name hf --dataset-path likaixin/InstructCoder --num-prompts 1000 --print-output
output
Online Inference median TPOT (ms)
TPOT ms
client cmd MTBench ``` vllm bench serve --port 9001 --save-result --save-detailed \ --model meta-llama/Llama-3.1-8B-Instruct \ --endpoint-type openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts 80 \ --max-concurrency 1 \ --result-dir "./log/EAGLE-1" ```instruct coder
Blazedit
vanilla
server cmd:
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 9001
eagle
server cmd:
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \ --disable-log-requests --port 9001 \ --speculative_config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}'
ngram
server cmd:
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \ --disable-log-requests --port 9001 \ --speculative_config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2}'
ngram-eagle
server cmd:
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \ --disable-log-requests --port 9001 \ --speculative_config '{"method": "ngram-eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens_per_method": "{\"ngram\": 5, \"eagle\": 3}", "prompt_lookup_max": 5, "prompt_lookup_min": 5}'
Analysis
Impact on AL
The sequence normalized AL of
ngram-eagle
is much higher for editing task since:Therefore, overall the AL is much better for
ngram-eagle
Overhead:
ngram-eagle
has similar overhead aseagle
since drafter has to run. Its overhead is higher thanngram
as ngram doesnt need to run any drafter auto-regressively.Performance analysis on Datasets
Empirically, sequence normalized AL on Blazedit when the edit distance norm is bw [0.1, 0.25] as reported above in Benchmarks section:
Theoretical Analysis
This in line with my theoretical calculation of AL of ngram-eagle if we change edit distance norm.
Precision AL means the AL when the draft was made. vLLM computes AL in this manner which is not complete information hence in the PR I introduced a new metric in offline inference called sequence normalized AL which represents the AL across a seq which gives the expected speedup from SD assuming 0 draft overhead and implementation inefficiency. More detail on the metric is in the PR.
When input is same as output, it will follow ngram's AL and when it diverges it will follow eagle's AL.
=1000/((edit_norm*1000/eagle_AL) + ((1-edit_norm)*1000/ngram_AL))
.=1000/((edit_norm*1000/1) + ((1-edit_norm)*1000/ngram_AL))
The numerator is 1000 tokens and denominator is the num of steps needed to produce them. The assumption of 1000 tokens doesnt matter since it gets canceled
<style type="text/css"></style>
As we can see, ngram-AL is strictly better than ngram and eagle
Final end to end results
yaxis - TPOT (ms), lower is better
ngram-eagle
is consistently among the fastest across different types of dataset.TODO