[V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model #17326

ekagra-ranjan · 2025-04-28T21:57:12Z

This PR:

Fixes: [V1][Spec Decode] Eagle Model loading #16035 (comment)
Frees ~1GB GPU memory for llama 3 model
Add output logging for sanity check in eagle offline bench
Tested by using my hacky metric PR and ensuring AL is high.

github-actions · 2025-04-28T21:57:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

WoosukKwon

@ekagra-ranjan Thanks for the PR!

One issue with the PR is that it assumes PP=1. Can you please handle PP > 1 as well (at least for llama)?

vllm/v1/spec_decode/eagle.py

mergify · 2025-04-29T21:17:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2025-05-09T04:04:29Z

@ekagra-ranjan Could you please update the PR? If handling PP is tricky, you can simply check the pipeline_parallel_size and raise an error if it's not 1 (for now). We can fix it later.

…agle-reuse-embed

ekagra-ranjan · 2025-05-10T19:27:54Z

vllm/model_executor/models/llama_eagle.py

-            skip_prefixes=(["lm_head."]
-                           if self.config.tie_word_embeddings else None),


eagle model def doesnt have lm_head nor the weights to removed it

@ekagra-ranjan Do you mean EAGLE1 doesn't have the LM head? I'm wondering because some EAGLE3 weights do include the LM head.

EAGLE1 reuses the lm_head of target model for each spec step whereas EAGLE3 does not. For e.g.,

yuhuili/EAGLE-LLaMA3-Instruct-8B has these weights

Number of weights: 10 Key: layers.0.self_attn.q_proj.weight, Shape: torch.Size([4096, 4096]), Dtype: torch.float16 Key: layers.0.self_attn.k_proj.weight, Shape: torch.Size([1024, 4096]), Dtype: torch.float16 Key: layers.0.self_attn.v_proj.weight, Shape: torch.Size([1024, 4096]), Dtype: torch.float16 Key: layers.0.self_attn.o_proj.weight, Shape: torch.Size([4096, 4096]), Dtype: torch.float16 Key: layers.0.mlp.gate_proj.weight, Shape: torch.Size([14336, 4096]), Dtype: torch.float16 Key: layers.0.mlp.up_proj.weight, Shape: torch.Size([14336, 4096]), Dtype: torch.float16 Key: layers.0.mlp.down_proj.weight, Shape: torch.Size([4096, 14336]), Dtype: torch.float16 Key: layers.0.post_attention_layernorm.weight, Shape: torch.Size([4096]), Dtype: torch.float16 Key: embed_tokens.weight, Shape: torch.Size([128256, 4096]), Dtype: torch.float16 Key: fc.weight, Shape: torch.Size([4096, 8192]), Dtype: torch.float16

EAGLE1 sets the target lm_head as draft's lm_head here

EAGLE 3's lm_head is not the same as the target model. It's noted in this PR as well #16937 (comment)

ekagra-ranjan · 2025-05-10T19:45:58Z

@WoosukKwon Done! For PP > 1, the target embed would be on rank 0 whereas the drafter will run on last rank so the drafter's embed cannot be shared with target. In the case, the current code will expect the embed weights to be present in draft checkpoint during weight loading when using PP and raise an exception if that's not the case.

WoosukKwon

@ekagra-ranjan Left some minor comments. Please check them out.

vllm/v1/spec_decode/eagle.py

Co-authored-by: Woosuk Kwon <[email protected]>

WoosukKwon · 2025-05-14T15:55:57Z

@ekagra-ranjan Please fix the lint errors.

ekagra-ranjan · 2025-05-14T16:14:30Z

@WoosukKwon - Done!

…aft model to free ~1GB for llama 3 model (vllm-project#17326) Co-authored-by: root <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

singh-git10 · 2025-06-02T20:44:27Z

@ekagra-ranjan @WoosukKwon I believe the scenario where the EAGLE-3 draft model has different embedding weights than the target model is not being properly handled in the current implementation. This issue specifically applies to the EAGLE-3 head for the Llama 3.3 70B model.(yuhuili/EAGLE3-LLaMA3.3-Instruct-70B.

ekagra-ranjan · 2025-06-02T22:29:20Z

@singh-git10 - its being addressed here: #19033

reuse embed

db7c978

ekagra-ranjan requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 28, 2025 21:57

ekagra-ranjan mentioned this pull request Apr 28, 2025

[V1][Spec Decode] Eagle Model loading #16035

Merged

mergify bot added documentation Improvements or additions to documentation v1 labels Apr 28, 2025

ekagra-ranjan added 2 commits April 28, 2025 22:24

pre-commit

0602b21

lint

8845d7c

WoosukKwon reviewed Apr 29, 2025

View reviewed changes

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Apr 29, 2025

markmc added the speculative-decoding label May 8, 2025

ekagra-ranjan added 3 commits May 10, 2025 16:23

Merge branch 'main' of https://github.com/vllm-project/vllm into er-e…

e7b514f

…agle-reuse-embed

handle PP

e230575

do init embed when PP=1

b2ead25

ekagra-ranjan requested review from zhuohan123 and youkaichao as code owners May 10, 2025 19:26

mergify bot removed the needs-rebase label May 10, 2025

ekagra-ranjan commented May 10, 2025

View reviewed changes

root and others added 2 commits May 10, 2025 19:28

clean

2682dbe

lint

47cd947

WoosukKwon reviewed May 13, 2025

View reviewed changes

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label May 13, 2025

ekagra-ranjan and others added 3 commits May 13, 2025 20:07

Update vllm/v1/spec_decode/eagle.py

641df29

Co-authored-by: Woosuk Kwon <[email protected]>

Update vllm/v1/spec_decode/eagle.py

689ae5a

Co-authored-by: Woosuk Kwon <[email protected]>

Update vllm/v1/spec_decode/eagle.py

68f0a53

Co-authored-by: Woosuk Kwon <[email protected]>

lint

45b624c

WoosukKwon merged commit 418d2f8 into vllm-project:main May 14, 2025
61 of 62 checks passed

		skip_prefixes=(["lm_head."]
		if self.config.tie_word_embeddings else None),

Uh oh!

[V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model #17326

[V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model #17326

Uh oh!

Conversation

ekagra-ranjan commented Apr 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Apr 29, 2025

Uh oh!

WoosukKwon commented May 9, 2025

Uh oh!

ekagra-ranjan May 10, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon May 13, 2025

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WoosukKwon commented May 14, 2025

Uh oh!

ekagra-ranjan commented May 14, 2025

Uh oh!

Uh oh!

singh-git10 commented Jun 2, 2025

Uh oh!

ekagra-ranjan commented Jun 2, 2025

Uh oh!

Uh oh!

ekagra-ranjan commented Apr 28, 2025 •

edited by github-actions bot

Loading

ekagra-ranjan May 14, 2025 •

edited

Loading

ekagra-ranjan commented May 10, 2025 •

edited

Loading