-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1][Spec Decode] Eagle Model loading #16035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
I'm sure this has been discussed elsewhere, but why introduce a V1-specific model ( e.g. it looked like @luyuzhe111 expected |
# We need to set the vllm_config here to register attention | ||
# layers in the forward context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call load_model()
from the __init__()
so that this function runs and the attn layer are registered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a bit on which load_model
are you talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one in this file spec_decode/eagle.py
here: https://github.com/vllm-project/vllm/pull/16035/files/59ee450306d3d719f78ad60c77ba9b739bc5cb11#diff-a4809a837fbf535a8f0999b11087a53ec1c53948b50c0a1fe64396bc86de9461R184
# We need to set the vllm_config here to register attention | ||
# layers in the forward context. | ||
with set_default_torch_dtype( | ||
draft_model_config.dtype), set_current_vllm_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I didnt get this part which says that setting the current vllm config will lead to registering attn layers. Can you pls share more?
My understanding is that, we did not change any self.vllm_config
in this function and the attn are registered when the model is initialized where it saves the attn prefix in static_forward_context which is then used during bind_kv_cache()
.
- So if the
load_model()
is called in__init__
in this file then would it not register the attn func without the need ofset_current_vllm_config(self.vllm_config)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which load_model
are you referring to? This one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one in spec_decode/eagle.py here: https://github.com/vllm-project/vllm/pull/16035/files/59ee450306d3d719f78ad60c77ba9b739bc5cb11#diff-a4809a837fbf535a8f0999b11087a53ec1c53948b50c0a1fe64396bc86de9461R184
I have broken my above question into 2 parts along with my understanding so that it is easier for you to explain what I am missing. Looking fwd to your response
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Hi folks, thanks for all the comments so far @markmc @ekagra-ranjan, I am double checking the correctness and have not started fixed the comments. I will start fixing comments tomorrow. Some updates:
Please start review and check correctness. cc @luyuzhe111 @WoosukKwon. |
@LiuXiaoxuanPKU Good results! I was wondering how are we able to run EAGLE give Task 2, 3 are in #15901 are WIP ? What are the implications/assumptions of these wrt to the results shared in this PR? |
Thanks for asking:
|
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
@LiuXiaoxuanPKU - are results on gsm8k? If possible can you run them on MTBench so that we can compare the results with SGL and identify gaps? https://docs.google.com/document/d/18ETJLsnxR88Qq3VDk5Mq-Hb7vuE9o3VNZ-hhz-OqAXk/edit?usp=sharing |
@LiuXiaoxuanPKU Hi Lily, when running with |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
I benchmarked this PR on MTBench using the I wired up the SD metrics in Scheduler in EngineCoreOutputs so that we can test the correctness with the Accept Length. Currently, the SD metrics in Scheduler gets reinitialized every engine step and is not aggregated so I fix that. Here is a dummy PR which has the benchmarking script and changes I did to get the below results on top of this PR. Pls lmk if something is incorrect in my setup. I can also raise a PR with the SD metric if that makes sense with some extra steps. Here is the cmd used
k=2 is 36% faster and k=4 is 28% faster than vanilla. Compared to the SGL bench:
vllm AL formula is here Trends:
I feel the formulas for AL are similar and that shouldn't be the cause of differences in the AL but we are getting lower AL for K=4. Pls lmk your thoughts or if I missed something. |
@ekagra-ranjan This PR itself is not enough to support eagle correctly. We need to handle the draft model's KV cache properly. |
@ekagra-ranjan Hi Ekagra, Thanks for sharing the benchmarking results! I can confirm the acceptance length you collected for vllm should be accurate, as I aggregate acceptance length from request-level, per-step acceptance counts in this PR #16367. I suspect you might have used EAGLE-2 in SGLang, which explains the larger acceptance length for k = 4. Can you double check on that? I will include the acceptance length comparison of vLLM against the EAGLE-1 repo soon. Regardless, I think we should merge this PR soon to unblock further developments. Without this PR, we can't even debug : ) |
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
@luyuzhe111 - thank you for your response. In SGL, I am using chain based draft where |
Hi @ekagra-ranjan, if it's indeed chain drafts, then I don't think the acceptance length in SGL makes any sense? Basically the result is saying, the first two draft positions have 0.72 tokens accepted, and the next two draft positions also have 0.68 tokens accepted. Even in the best case scenario (mean number of accepted tokens accepted at each step = [1, 0.37, 0.35, 0.34, 0.33], assuming minimal acceptance rate drop between draft steps), it's impossible to have 0.68 tokens accepted at the third and fourth positions combined. |
For reference, these are the acceptance length comparisons between EAGLE repo, vLLM v0, and vLLM v1. On MT Bench, assuming single-draft. Acceptance length is computed using #16367 When max number generated tokens = 256
When max number generated tokens = 512
Observations:
Hope the numbers here from the EAGLE repo can serve as a reference for future development efforts. cc @LiuXiaoxuanPKU @WoosukKwon |
HI @luyuzhe111 , Great work! Nice to see such an early benchmarking!
IIUC, this PR is intended to allocate cache slot for the draft model, which should not affect the acceptance rate? If the implementation is good, I assume it should be more related to the sampling/rejection method. And there indeed has some gaps, like https://github.com/vllm-project/vllm/blob/main/vllm/v1/spec_decode/eagle.py#L219. Maybe could you double-check whether the default sampling parameter consistent with reported in the original EAGLE? |
I have the same understanding. Can someone please share why #16370 would improve correctness? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the PR!
@wwl2755 @ekagra-ranjan Hi Wenlong and Ekagra, thanks for the comments. I actually don't understand #16370 too well and was hoping to dive deeper. Maybe we can keep the discussion under that PR now that this PR is merged? Regarding the hypothesis on acceptance mechanism, I don't think sampling parameter is the issue since I used greedy sampling for both EAGLE repo and vLLM. |
@luyuzhe111 Thanks for sharing your observation! The issue was that SGL uses Using 1 draft token SGL gets 1.72 AL |
Signed-off-by: LiuXiaoxuanPKU <[email protected]> Signed-off-by: Yang Wang <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]>
Signed-off-by: LiuXiaoxuanPKU <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Task 1 of #15901
Some limitations:
How to run this PR:
python examples/offline_inference/eagle.py
Ignore the metrics used/printed in that file.