-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Rename eagle cache dir #19027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename eagle cache dir #19027
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
The way the eagle head caching works in vLLM today is:
- there is a base model. This one gets a hash (which is used as the
cache dir)
- the eagle head has its own model. This model is pre-determined by the
hash of the base model. The eagle head needs its own cache dir.
This PR updates the name of the hash dir to be
`{base_model}-{eagle_method}` for readability reasons.
Test Plan:
- `python vllm/examples/offline_inference/eagle.py` and checked the
cache directory name.
Signed-off-by: rzou <[email protected]>
| # calls in a single model, please open an issue and let's discuss. | ||
| speculative_config = self.vllm_config.speculative_config | ||
| if (speculative_config is not None and speculative_config.use_eagle()): | ||
| if compilation_counter.num_graphs_seen == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have multiple layers or graph break, how do we handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR improves on the previous state. It doesn't change anything about it.
multiple layers
The support_torch_compile gets applied on a model with multiple layers. Example:
vllm/vllm/model_executor/models/gemma3.py
Lines 345 to 346 in ca2f6b9
| @support_torch_compile | |
| class Gemma3Model(nn.Module): |
graph break
My understanding is that there are no graph breaks in vLLM. fullgraph is set to True by default:
vllm/vllm/compilation/wrapper.py
Line 46 in ca2f6b9
| fullgraph=envs.VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE, |
|
I added #19064 to address this problem. please take a look. the problem with this PR, is that it cannot generalize to vision encoders in the future. I expect we might have the following compilation in the end:
|
|
not needed anymore |
The way the eagle head caching works in vLLM today:
{base_model}-{eagle_method}for readability reasons.Test Plan:
python vllm/examples/offline_inference/eagle.pyand checked the cache directory name.