-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model #17326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model #17326
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ekagra-ranjan Thanks for the PR!
One issue with the PR is that it assumes PP=1. Can you please handle PP > 1 as well (at least for llama)?
This pull request has merge conflicts that must be resolved before it can be |
@ekagra-ranjan Could you please update the PR? If handling PP is tricky, you can simply check the |
skip_prefixes=(["lm_head."] | ||
if self.config.tie_word_embeddings else None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eagle model def doesnt have lm_head nor the weights to removed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ekagra-ranjan Do you mean EAGLE1 doesn't have the LM head? I'm wondering because some EAGLE3 weights do include the LM head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EAGLE1 reuses the lm_head of target model for each spec step whereas EAGLE3 does not. For e.g.,
yuhuili/EAGLE-LLaMA3-Instruct-8B
has these weights
Number of weights: 10
Key: layers.0.self_attn.q_proj.weight, Shape: torch.Size([4096, 4096]), Dtype: torch.float16
Key: layers.0.self_attn.k_proj.weight, Shape: torch.Size([1024, 4096]), Dtype: torch.float16
Key: layers.0.self_attn.v_proj.weight, Shape: torch.Size([1024, 4096]), Dtype: torch.float16
Key: layers.0.self_attn.o_proj.weight, Shape: torch.Size([4096, 4096]), Dtype: torch.float16
Key: layers.0.mlp.gate_proj.weight, Shape: torch.Size([14336, 4096]), Dtype: torch.float16
Key: layers.0.mlp.up_proj.weight, Shape: torch.Size([14336, 4096]), Dtype: torch.float16
Key: layers.0.mlp.down_proj.weight, Shape: torch.Size([4096, 14336]), Dtype: torch.float16
Key: layers.0.post_attention_layernorm.weight, Shape: torch.Size([4096]), Dtype: torch.float16
Key: embed_tokens.weight, Shape: torch.Size([128256, 4096]), Dtype: torch.float16
Key: fc.weight, Shape: torch.Size([4096, 8192]), Dtype: torch.float16
EAGLE1 sets the target lm_head as draft's lm_head here
EAGLE 3's lm_head is not the same as the target model. It's noted in this PR as well #16937 (comment)
@WoosukKwon Done! For PP > 1, the target embed would be on rank 0 whereas the drafter will run on last rank so the drafter's embed cannot be shared with target. In the case, the current code will expect the embed weights to be present in draft checkpoint during weight loading when using PP and raise an exception if that's not the case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ekagra-ranjan Left some minor comments. Please check them out.
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
@ekagra-ranjan Please fix the lint errors. |
@WoosukKwon - Done! |
…aft model to free ~1GB for llama 3 model (vllm-project#17326) Co-authored-by: root <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
@ekagra-ranjan @WoosukKwon I believe the scenario where the EAGLE-3 draft model has different embedding weights than the target model is not being properly handled in the current implementation. This issue specifically applies to the EAGLE-3 head for the Llama 3.3 70B model.(yuhuili/EAGLE3-LLaMA3.3-Instruct-70B. |
@singh-git10 - its being addressed here: #19033 |
This PR: