Skip to content

Conversation

@bigPYJ1151
Copy link
Member

@bigPYJ1151 bigPYJ1151 commented Jul 3, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

  • Migrate all CPU attention code to V1
  • Make all CPU tests use V1

Test Plan

Test Result

(Optional) Documentation Update

@github-actions
Copy link

github-actions bot commented Jul 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @bigPYJ1151, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request initiates the deprecation of V0 CPU components by extensively refactoring the CPU attention backend to align with the V1 architecture. It introduces a new, more flexible attention metadata structure and abstracts paged attention operations, while updating CI tests to reflect the ongoing transition and new V1 CPU capabilities.

Highlights

  • CPU Attention Backend Refactor: The CPU attention backend (vllm/v1/attention/backends/cpu_attn.py) has been significantly refactored to align with the V1 architecture, leveraging PyTorch's scaled_dot_product_attention for core attention computations.
  • Enhanced Attention Metadata: A new and expanded TorchSDPAMetadata class has been introduced to support diverse attention types (decoder, encoder, cross-attention) and chunked prefill, providing a more robust metadata structure for V1 attention.
  • Paged Attention Abstraction: Paged attention operations are now abstracted into _PagedAttention and _IPEXPagedAttention classes, enabling dynamic selection of IPEX-optimized paths when available, improving flexibility and performance.
  • CI Test Adjustments: Continuous Integration (CI) tests for CPU have been updated to disable V0-specific attention kernel tests and certain model tests, while adjusting others to reflect V1 CPU capabilities and limitations (e.g., float32 support, sliding window).
  • V0 CPU Deprecation Enforcement: Explicit pytest.skip conditions have been added or modified in various test files to ensure that V0 CPU paths are no longer tested for certain functionalities (e.g., reward models, some quantization tests), reinforcing the transition to V1.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to transition CPU support from V0 to V1, incorporating Intel Extension for PyTorch (IPEX) for performance optimizations. The changes involve updating test configurations to reflect this shift, enabling float32 support for V1 CPU, and refactoring the CPU attention backend to leverage IPEX. While the overall direction is clear, some test configurations contain logical inconsistencies, and there are minor code redundancies and an outdated error message that should be addressed.

Comment on lines +42 to +47
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The BAAI/bge-base-en-v1.5 model is marked with pytest.mark.cpu_model but also pytest.mark.skip_v1. Given the comment "CPU only supports V1", this combination means the test will never actually run on CPU. This appears to be a logical contradiction in the test configuration.

Comment on lines +454 to +455
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message KV sharing is not supported in V0. seems incorrect for a file located in vllm/v1/attention/backends/. It should probably state that KV sharing is not supported by this specific V1 CPU backend, or generally not supported by this attention implementation, rather than referencing V0.

Suggested change
if kv_sharing_target_layer_name is not None:
raise NotImplementedError("KV sharing is not supported in V0.")
if kv_sharing_target_layer_name is not None:
raise NotImplementedError("KV sharing is not supported by TorchSDPABackendImpl.")

Comment on lines +595 to +596
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expression prefill_meta.prefill_metadata.chunked_prefill appears to be a redundant access. It should likely be prefill_meta.chunked_prefill.

Suggested change
if not prefill_meta.prefill_metadata.chunked_prefill: # type: ignore
assert attn_metadata.seq_lens is not None
if not prefill_meta.chunked_prefill: # type: ignore

Comment on lines +606 to +607
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The line output = torch.empty_like(query) is redundant here. It is already called on line 593 before the if statement, and its scope covers this else block.

                ipex_modules.PagedAttention.flash_attn_varlen_func(

Signed-off-by: jiang1.li <[email protected]>
@bigPYJ1151 bigPYJ1151 force-pushed the remove_v0 branch 2 times, most recently from 5bd88ca to ca8fac5 Compare July 3, 2025 17:30
Signed-off-by: jiang1.li <[email protected]>
@bigPYJ1151 bigPYJ1151 changed the title [WIP][V0 deprecation] Remove V0 CPU [V0 deprecation] Remove V0 CPU Jul 4, 2025
@bigPYJ1151
Copy link
Member Author

Hi @WoosukKwon , this PR removed all CPU attention code in V0 and made all CPU tests use V1. All fast checks and the CPU tests passed.

Please take a look and merge it to #20412, thanks!

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 4, 2025
@WoosukKwon WoosukKwon merged commit ec0ff9f into vllm-project:woosuk/remove-v0-part1 Jul 4, 2025
20 of 34 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in V0 Deprecation Jul 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants