-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B #19033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B #19033
Conversation
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
…edding-init Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
def current_memory_usage(self) -> float: | ||
# Return the memory usage in bytes. | ||
from vllm.platforms import current_platform | ||
gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WoosukKwon can share if gc.collect()
and torch.cuda.empty_cache()
are fine here. Maybe there is some reason why they were not already added before.
I believe this was added because we delete some torch tensor after allocation. Just in case for some reason we think its better to avoid these new gc commands, an alternative approach to avoid it would be to first load draft model weights from checkpoint and determine if the draft vocab is needed and then pass this info to draft model object instantiation which can skip allocating draft vocab and achieve the same objective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current approach makes sense, as enforcing GC and clearing the torch cache seem like natural choices to improve the accuracy of the memory profiler.
If we foresee any issues with calling GC/cleanup in this way, then I'm on board for doing it the other way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agreed with @ekagra-ranjan, though I didn’t see a clear problem. Let’s keep this in mind and revisit if any issue arises.
I feel like this kind of bug is repeating, probably because 1) we lack tests on EAGLE and 2) I'm not a great reviewer for eagle's weight loading (I don't have enough background). |
I can contribute additional testing for EAGLE in the next couple days. |
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
@benchislett Do you have any comparison of the impact of reduced hidden size on acceptance length versus the speedup, as has been done for Llama 3.3 70B EAGLE-3 head? |
No, this was an architecture decision made at training time by the EAGLE3 authors. I do not recall any comments from the authors regarding the ablation of this parameter. |
@WoosukKwon This PR should be ready, PTAL. I've updated the weight loading test to cover the edge-cases more exhaustively, and also added some more thorough acceptance-rate tests in #19104 . |
Hi @benchislett thanks a lot for this PR! Super useful. Do you see OTPTS improvement of official-released EAGLE-3 70B over EAGLE-2, with your code changes? Previously, we are able to see improvement on acceptance rates but cannot see OTPS gain previously. Code is based on the vLLM without your fix, and head is EAGLE-3 70B. Many thanks |
Sorry to follow up, I pulled and installed your PR and saw an error with vllm loading llama 3.3 70B from their official release. Wonder if you see the same issue or there is something wrong I configured wrongly on my end. Here is details: Thank you very much! |
@Neo9061 I worry that this might be an inherent limitation of the architecture used for the EAGLE3 heads. If it seems correct to you when running with max_seq_len==2048, I am inclined to think that this is the cause. Regarding the speedup, I have not done comparison benchmarks with EAGLE1 but the EAGLE3 head does seem quite performant. If you identify a degradation compared to EAGLE1, please share the details and we will work to improve the performance. Thank you for the feedback and early testing. |
@Neo9061 -in the last benchmark, we have seen that EAGLE-3 on vllm has better AL than Eagle-1 but the TOPS is worse in online serving. offline bench - #16937 (comment) Can you pls share your exact setup for comparison? |
This was also before torch.compile support was added to EAGLE/EAGLE3 I think. I expect that those numbers would look different now. |
@ekagra-ranjan @benchislett Sure, we will share the data soon. Basically in the online inference, for 70B, using the official EAGLE-2 head and EAGLE-3 head, we saw acceptance rate improves 29% and head size reduces 50% but the OTPS gain is only 26%. We are using llmperf + vllm to benchmark. @singh-git10 will provide an github issue soon. |
@ekagra-ranjan @benchislett I have created a GitHub issue here. Please let me know if any other details are needed to reproduce the results. |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…roject#19033) Signed-off-by: Benjamin Chislett <[email protected]>
A previous PR allowed the EAGLE vocab embeddings to be omitted from initialization during weight loading, operating under the assumption that they are always overridden by the embeddings from the target model. However, the EAGLE3 head for Llama 3.3 70B has a different hidden size than the target model, and thus a distinct vocab embeding.
Currently, Llama 70B is unusable on main for this reason. The vocab embedding must be initialized for that case. I updated the model code to always declare the vocab embedding, and load it when it is present. Then, the eagle load_weights will free the old one and replace it with the target model's embeddings only when they have the same shape.
Notes
gc.collect
andtorch.cuda.empty_cache
to the memory profiling so that the deallocated vocab embedding will not affect memory profiling. I confirmed locally that this is works as intended, showing 1GB less on the memory profiling stage.