Skip to content

Conversation

@orozery
Copy link
Contributor

@orozery orozery commented Jun 25, 2025

This PR introduces an OffloadingManager with a pluggable backend, which evicts blocks by LRU.
A CPU backend is included.

Part of the work described in RFC #19854
Depends on #19848

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @orozery, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a foundational system for managing the offloading of KV (Key-Value) data in vLLM v1. It establishes a flexible, pluggable architecture with an abstract OffloadingManager and Backend interface, enabling different storage mediums. The initial implementation provides an LRU-based management strategy and a concrete CPU memory backend, which is crucial for efficient memory utilization and handling of large language models.

Highlights

  • New Offloading System Core: I've introduced an abstract OffloadingManager interface for managing KV data offloading, along with LoadStoreSpec for medium-specific data handling. This establishes the foundational architecture for flexible offloading.
  • LRU-based Management: I've implemented LRUOffloadingManager, a concrete offloading manager that applies a Least Recently Used (LRU) eviction policy. This manager works with a pluggable Backend interface to decide which blocks to evict when space is needed.
  • CPU Offloading Backend: I've added a concrete CPUBackend implementation for the offloading manager. This allows KV blocks to be efficiently offloaded to and managed within CPU memory, complete with block allocation and freeing mechanisms.
  • Storage Medium Specifications: I've defined specific LoadStoreSpec implementations for different storage mediums, including GPULoadStoreSpec and CPULoadStoreSpec, to standardize how block data is referenced across various storage types.
  • Comprehensive Testing and CI Integration: I've included comprehensive unit tests for the LRUOffloadingManager and CPUBackend to verify correct functionality, including block storage, loading, lookup, and eviction logic. These tests have also been integrated into the .buildkite/test-pipeline.yaml to ensure continuous validation in the CI/CD process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces an LRU-based CPU offloading management system. It includes an OffloadingManager with a pluggable backend and a CPU backend. The system evicts blocks based on the LRU algorithm. I have suggested some improvements to the test cases and the LRU manager.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@orozery orozery force-pushed the cpu-offloading-manager branch 3 times, most recently from a0cb125 to 11d5135 Compare July 8, 2025 10:19
@josephrocca josephrocca mentioned this pull request Jul 9, 2025
1 task
@orozery orozery force-pushed the cpu-offloading-manager branch from 11d5135 to d141661 Compare July 23, 2025 10:12
@KuntaiDu
Copy link
Collaborator

Interested in this PR, and just wondering if the PR is currently runnable or not.

@orozery orozery force-pushed the cpu-offloading-manager branch 2 times, most recently from a346c1b to 5c7cf02 Compare August 5, 2025 06:24
@orozery orozery force-pushed the cpu-offloading-manager branch 5 times, most recently from 74d048a to dd72909 Compare August 14, 2025 12:38
@orozery orozery force-pushed the cpu-offloading-manager branch from dd72909 to 852be73 Compare September 4, 2025 08:17
@orozery orozery force-pushed the cpu-offloading-manager branch from 852be73 to e74cb03 Compare September 8, 2025 06:13
@ApostaC
Copy link
Collaborator

ApostaC commented Sep 9, 2025

Hey @orozery , I'm going to review this PR. IIUC, the core difference between this PR and #19848 are just cpu.py and lru_manager.py, right?

@orozery
Copy link
Contributor Author

orozery commented Sep 9, 2025

Hey @orozery , I'm going to review this PR. IIUC, the core difference between this PR and #19848 are just cpu.py and lru_manager.py, right?

Correct.
More generally, each PR introduces only a single commit, so you can limit the review to the last commit.
When I push changes I always rebase so there's only a single commit for a PR.

Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

raise NotImplementedError


class Backend(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be better to put the Backend class into a separate file? I feel like, rather than belonging to the lru_manager, it looks like an independent module that has a well-defined set of functionality.

A proposal:

  • Have a file called backend.py that has the definition of the Backend class.
  • The lru_manager will do from vllm.v1.offloading.backend import Backend
  • Have another folder called offload_backends/ under vllm/v1/offloading/, and put the CPU and potential future backend implementation there.

Comment on lines 131 to 133
if not self.blocks.get(block_hash):
continue
self.blocks.move_to_end(block_hash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this also work? Just to simplify the code a bit.

Suggested change
if not self.blocks.get(block_hash):
continue
self.blocks.move_to_end(block_hash)
if block_hash in self.blocks:
self.blocks.move_to_end(block_hash)

Comment on lines 110 to 32
def lookup(self, block_hashes: list[int]) -> int:
hit_count = 0
for block_hash in block_hashes:
block = self.blocks.get(block_hash)
if block is None or not block.is_ready:
break
hit_count += 1
return hit_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be any race condition that the block is evicted after lookup counts the block as hit? This could potentially cause some correctness issues in the scheduler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For shared storage you're correct - there's a race.
But shared storage should actually not use LRU manager as the state is too big to keep in memory, and you would actually evict by some less exact mechanism than LRU (e.g. keeping each block for a fixed amount of time).
For CPU or local disk there is no race since they are fully owned by this specific scheduler instance.

@orozery orozery force-pushed the cpu-offloading-manager branch from e74cb03 to 3877a0b Compare September 10, 2025 08:03
Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 11, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @orozery. I think my comments are all style/efficiency related.

from vllm.v1.offloading.abstract import LoadStoreSpec


class BlockStatus(ctypes.Structure):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about the use of c struct for this in particular?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to reduce the memory footprint per block.
From what I've read this allows to use more compact representation.

self.block_id = block_id

def get_load_store_spec(self, block_hash: int) -> CPULoadStoreSpec:
return CPULoadStoreSpec(self.block_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not be seeing the whole picture but do we need these wrapper types? We'll be creating many python objects when there's a lot of blocks...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a single wrapper per list of blocks

to_evict = []
if num_blocks_to_evict > 0:
for block_hash, block in self.blocks.items():
if block.ref_cnt == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth maintaining a count of unused blocks? then this check would be more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to evict blocks by their LRU.
Right now I'm using a single ordered dict that holds all blocks, both "used" and "unused" (ref_cnt==0).
Most blocks should be unused, especially those whose LRU is the oldest, so this should be pretty good.
I don't see a good way to optimize it.

@orozery orozery force-pushed the cpu-offloading-manager branch from 3877a0b to 220a35f Compare September 15, 2025 15:32
@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@orozery orozery force-pushed the cpu-offloading-manager branch from 220a35f to 43b44ec Compare September 17, 2025 09:33
@njhill njhill changed the title v1: Introduce LRU-based CPU offloading management [KV offload][2/N] Introduce LRU-based CPU offloading management Sep 18, 2025
This commit introduces an OffloadingManager with a pluggable backend,
which evicts blocks by LRU.
A CPU backend is included.

Signed-off-by: Or Ozeri <[email protected]>
@orozery orozery force-pushed the cpu-offloading-manager branch from 43b44ec to 2f04f0b Compare September 18, 2025 21:05
@njhill njhill enabled auto-merge (squash) September 18, 2025 21:38
@njhill njhill merged commit 9d1c50a into vllm-project:main Sep 19, 2025
41 checks passed
ywang96 pushed a commit to ywang96/vllm that referenced this pull request Sep 19, 2025
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants