[V0 deprecation] Remove V0 CPU #20437

bigPYJ1151 · 2025-07-03T11:32:38Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Migrate all CPU attention code to V1
Make all CPU tests use V1

Test Plan

Test Result

(Optional) Documentation Update

github-actions · 2025-07-03T11:32:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @bigPYJ1151, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request initiates the deprecation of V0 CPU components by extensively refactoring the CPU attention backend to align with the V1 architecture. It introduces a new, more flexible attention metadata structure and abstracts paged attention operations, while updating CI tests to reflect the ongoing transition and new V1 CPU capabilities.

Highlights

CPU Attention Backend Refactor: The CPU attention backend (vllm/v1/attention/backends/cpu_attn.py) has been significantly refactored to align with the V1 architecture, leveraging PyTorch's scaled_dot_product_attention for core attention computations.
Enhanced Attention Metadata: A new and expanded TorchSDPAMetadata class has been introduced to support diverse attention types (decoder, encoder, cross-attention) and chunked prefill, providing a more robust metadata structure for V1 attention.
Paged Attention Abstraction: Paged attention operations are now abstracted into _PagedAttention and _IPEXPagedAttention classes, enabling dynamic selection of IPEX-optimized paths when available, improving flexibility and performance.
CI Test Adjustments: Continuous Integration (CI) tests for CPU have been updated to disable V0-specific attention kernel tests and certain model tests, while adjusting others to reflect V1 CPU capabilities and limitations (e.g., float32 support, sliding window).
V0 CPU Deprecation Enforcement: Explicit pytest.skip conditions have been added or modified in various test files to ensure that V0 CPU paths are no longer tested for certain functionalities (e.g., reward models, some quantization tests), reinforcing the transition to V1.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to transition CPU support from V0 to V1, incorporating Intel Extension for PyTorch (IPEX) for performance optimizations. The changes involve updating test configurations to reflect this shift, enabling float32 support for V1 CPU, and refactoring the CPU attention backend to leverage IPEX. While the overall direction is clear, some test configurations contain logical inconsistencies, and there are minor code redundancies and an outdated error message that should be addressed.

gemini-code-assist · 2025-07-03T11:34:07Z

tests/models/language/pooling/test_embedding.py

The BAAI/bge-base-en-v1.5 model is marked with pytest.mark.cpu_model but also pytest.mark.skip_v1. Given the comment "CPU only supports V1", this combination means the test will never actually run on CPU. This appears to be a logical contradiction in the test configuration.

gemini-code-assist · 2025-07-03T11:34:07Z

vllm/v1/attention/backends/cpu_attn.py

The error message KV sharing is not supported in V0. seems incorrect for a file located in vllm/v1/attention/backends/. It should probably state that KV sharing is not supported by this specific V1 CPU backend, or generally not supported by this attention implementation, rather than referencing V0.

Suggested change

if kv_sharing_target_layer_name is not None:

raise NotImplementedError("KV sharing is not supported in V0.")

if kv_sharing_target_layer_name is not None:

raise NotImplementedError("KV sharing is not supported by TorchSDPABackendImpl.")

gemini-code-assist · 2025-07-03T11:34:08Z

vllm/v1/attention/backends/cpu_attn.py

The expression prefill_meta.prefill_metadata.chunked_prefill appears to be a redundant access. It should likely be prefill_meta.chunked_prefill.

Suggested change

if not prefill_meta.prefill_metadata.chunked_prefill: # type: ignore

assert attn_metadata.seq_lens is not None

if not prefill_meta.chunked_prefill: # type: ignore

gemini-code-assist · 2025-07-03T11:34:08Z

vllm/v1/attention/backends/cpu_attn.py

The line output = torch.empty_like(query) is redundant here. It is already called on line 593 before the if statement, and its scope covers this else block.

ipex_modules.PagedAttention.flash_attn_varlen_func(

Signed-off-by: jiang1.li <[email protected]>

bigPYJ1151 · 2025-07-04T03:00:49Z

Hi @WoosukKwon , this PR removed all CPU attention code in V0 and made all CPU tests use V1. All fast checks and the CPU tests passed.

Please take a look and merge it to #20412, thanks!

bigPYJ1151 requested review from DarkLight1337, WoosukKwon, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat and ywang96 as code owners July 3, 2025 11:32

mergify bot added ci/build v1 labels Jul 3, 2025

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

DarkLight1337 mentioned this pull request Jul 3, 2025

[Hardware][POWER] Add Power (ppc64le)–specific CPU binding for VLLM_CPU_OMP_THREADS_BIND=auto #20387

Closed

4 tasks

bigPYJ1151 force-pushed the remove_v0 branch from 4815cef to c8538f1 Compare July 3, 2025 14:41

WoosukKwon moved this to In Progress in V0 Deprecation Jul 3, 2025

WoosukKwon added this to V0 Deprecation Jul 3, 2025

refactor cpu attn

422cff2

Signed-off-by: jiang1.li <[email protected]>

bigPYJ1151 force-pushed the remove_v0 branch 2 times, most recently from 5bd88ca to ca8fac5 Compare July 3, 2025 17:30

fix tests

89d0b34

Signed-off-by: jiang1.li <[email protected]>

bigPYJ1151 force-pushed the remove_v0 branch from ca8fac5 to 89d0b34 Compare July 3, 2025 19:20

bigPYJ1151 changed the title ~~[WIP][V0 deprecation] Remove V0 CPU~~ [V0 deprecation] Remove V0 CPU Jul 4, 2025

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 4, 2025

WoosukKwon merged commit ec0ff9f into vllm-project:woosuk/remove-v0-part1 Jul 4, 2025
20 of 34 checks passed

github-project-automation bot moved this from In Progress to Done in V0 Deprecation Jul 4, 2025

Akashcodes732 mentioned this pull request Jul 7, 2025

[Hardware][PPC64LE] Enable V1 for ppc64le and ARM #20554

Merged

4 tasks

bigPYJ1151 mentioned this pull request Jul 7, 2025

[CI/Build][CPU] Fix CPU CI and remove all CPU V0 files #20560

Merged

4 tasks

nikheal2 mentioned this pull request Aug 12, 2025

[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes #22712

Closed

4 tasks

nikheal2 mentioned this pull request Aug 12, 2025

[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes #22725

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V0 deprecation] Remove V0 CPU #20437

[V0 deprecation] Remove V0 CPU #20437

Uh oh!

bigPYJ1151 commented Jul 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

bigPYJ1151 commented Jul 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        if kv_sharing_target_layer_name is not None:
-            raise NotImplementedError("KV sharing is not supported in V0.")
+        if kv_sharing_target_layer_name is not None:
+            raise NotImplementedError("KV sharing is not supported by TorchSDPABackendImpl.")

	if not prefill_meta.prefill_metadata.chunked_prefill: # type: ignore
	assert attn_metadata.seq_lens is not None
	if not prefill_meta.chunked_prefill: # type: ignore

Uh oh!

[V0 deprecation] Remove V0 CPU #20437

[V0 deprecation] Remove V0 CPU #20437

Uh oh!

Conversation

bigPYJ1151 commented Jul 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

bigPYJ1151 commented Jul 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bigPYJ1151 commented Jul 3, 2025 •

edited by github-actions bot

Loading