-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[V0 deprecation] Remove V0 CPU #20437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V0 deprecation] Remove V0 CPU #20437
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @bigPYJ1151, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request initiates the deprecation of V0 CPU components by extensively refactoring the CPU attention backend to align with the V1 architecture. It introduces a new, more flexible attention metadata structure and abstracts paged attention operations, while updating CI tests to reflect the ongoing transition and new V1 CPU capabilities.
Highlights
- CPU Attention Backend Refactor: The CPU attention backend (
vllm/v1/attention/backends/cpu_attn.py) has been significantly refactored to align with the V1 architecture, leveraging PyTorch'sscaled_dot_product_attentionfor core attention computations. - Enhanced Attention Metadata: A new and expanded
TorchSDPAMetadataclass has been introduced to support diverse attention types (decoder, encoder, cross-attention) and chunked prefill, providing a more robust metadata structure for V1 attention. - Paged Attention Abstraction: Paged attention operations are now abstracted into
_PagedAttentionand_IPEXPagedAttentionclasses, enabling dynamic selection of IPEX-optimized paths when available, improving flexibility and performance. - CI Test Adjustments: Continuous Integration (CI) tests for CPU have been updated to disable V0-specific attention kernel tests and certain model tests, while adjusting others to reflect V1 CPU capabilities and limitations (e.g., float32 support, sliding window).
- V0 CPU Deprecation Enforcement: Explicit
pytest.skipconditions have been added or modified in various test files to ensure that V0 CPU paths are no longer tested for certain functionalities (e.g., reward models, some quantization tests), reinforcing the transition to V1.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to transition CPU support from V0 to V1, incorporating Intel Extension for PyTorch (IPEX) for performance optimizations. The changes involve updating test configurations to reflect this shift, enabling float32 support for V1 CPU, and refactoring the CPU attention backend to leverage IPEX. While the overall direction is clear, some test configurations contain logical inconsistencies, and there are minor code redundancies and an outdated error message that should be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message KV sharing is not supported in V0. seems incorrect for a file located in vllm/v1/attention/backends/. It should probably state that KV sharing is not supported by this specific V1 CPU backend, or generally not supported by this attention implementation, rather than referencing V0.
| if kv_sharing_target_layer_name is not None: | |
| raise NotImplementedError("KV sharing is not supported in V0.") | |
| if kv_sharing_target_layer_name is not None: | |
| raise NotImplementedError("KV sharing is not supported by TorchSDPABackendImpl.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression prefill_meta.prefill_metadata.chunked_prefill appears to be a redundant access. It should likely be prefill_meta.chunked_prefill.
| if not prefill_meta.prefill_metadata.chunked_prefill: # type: ignore | |
| assert attn_metadata.seq_lens is not None | |
| if not prefill_meta.chunked_prefill: # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: jiang1.li <[email protected]>
5bd88ca to
ca8fac5
Compare
Signed-off-by: jiang1.li <[email protected]>
|
Hi @WoosukKwon , this PR removed all CPU attention code in V0 and made all CPU tests use V1. All fast checks and the CPU tests passed. Please take a look and merge it to #20412, thanks! |
ec0ff9f
into
vllm-project:woosuk/remove-v0-part1
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update