[KV offload][2/N] Introduce LRU-based CPU offloading management #20075

orozery · 2025-06-25T12:16:01Z

This PR introduces an OffloadingManager with a pluggable backend, which evicts blocks by LRU.
A CPU backend is included.

Part of the work described in RFC #19854
Depends on #19848

gemini-code-assist

Summary of Changes

Hello @orozery, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a foundational system for managing the offloading of KV (Key-Value) data in vLLM v1. It establishes a flexible, pluggable architecture with an abstract OffloadingManager and Backend interface, enabling different storage mediums. The initial implementation provides an LRU-based management strategy and a concrete CPU memory backend, which is crucial for efficient memory utilization and handling of large language models.

Highlights

New Offloading System Core: I've introduced an abstract OffloadingManager interface for managing KV data offloading, along with LoadStoreSpec for medium-specific data handling. This establishes the foundational architecture for flexible offloading.
LRU-based Management: I've implemented LRUOffloadingManager, a concrete offloading manager that applies a Least Recently Used (LRU) eviction policy. This manager works with a pluggable Backend interface to decide which blocks to evict when space is needed.
CPU Offloading Backend: I've added a concrete CPUBackend implementation for the offloading manager. This allows KV blocks to be efficiently offloaded to and managed within CPU memory, complete with block allocation and freeing mechanisms.
Storage Medium Specifications: I've defined specific LoadStoreSpec implementations for different storage mediums, including GPULoadStoreSpec and CPULoadStoreSpec, to standardize how block data is referenced across various storage types.
Comprehensive Testing and CI Integration: I've included comprehensive unit tests for the LRUOffloadingManager and CPUBackend to verify correct functionality, including block storage, loading, lookup, and eviction logic. These tests have also been integrated into the .buildkite/test-pipeline.yaml to ensure continuous validation in the CI/CD process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR introduces an LRU-based CPU offloading management system. It includes an OffloadingManager with a pluggable backend and a CPU backend. The system evicts blocks based on the LRU algorithm. I have suggested some improvements to the test cases and the LRU manager.

tests/v1/offloading/test_cpu_manager.py

vllm/v1/offloading/lru_manager.py

github-actions · 2025-06-25T13:07:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/v1/offloading/abstract.py

KuntaiDu · 2025-07-31T21:39:59Z

Interested in this PR, and just wondering if the PR is currently runnable or not.

ApostaC · 2025-09-09T04:34:21Z

Hey @orozery , I'm going to review this PR. IIUC, the core difference between this PR and #19848 are just cpu.py and lru_manager.py, right?

orozery · 2025-09-09T06:49:16Z

Hey @orozery , I'm going to review this PR. IIUC, the core difference between this PR and #19848 are just cpu.py and lru_manager.py, right?

Correct.
More generally, each PR introduces only a single commit, so you can limit the review to the last commit.
When I push changes I always rebase so there's only a single commit for a PR.

ApostaC

Otherwise LGTM!

ApostaC · 2025-09-10T00:53:07Z

vllm/v1/offloading/lru_manager.py

+        raise NotImplementedError
+
+
+class Backend(ABC):


Do you think it would be better to put the Backend class into a separate file? I feel like, rather than belonging to the lru_manager, it looks like an independent module that has a well-defined set of functionality.

A proposal:

Have a file called backend.py that has the definition of the Backend class.

The lru_manager will do from vllm.v1.offloading.backend import Backend

Have another folder called offload_backends/ under vllm/v1/offloading/, and put the CPU and potential future backend implementation there.

vllm/v1/offloading/lru_manager.py

ApostaC · 2025-09-10T01:13:45Z

vllm/v1/offloading/lru_manager.py

+            if not self.blocks.get(block_hash):
+                continue
+            self.blocks.move_to_end(block_hash)


Will this also work? Just to simplify the code a bit.

Suggested change

if not self.blocks.get(block_hash):

continue

self.blocks.move_to_end(block_hash)

if block_hash in self.blocks:

self.blocks.move_to_end(block_hash)

ApostaC · 2025-09-10T01:18:03Z

vllm/v1/offloading/lru_manager.py

+    def lookup(self, block_hashes: list[int]) -> int:
+        hit_count = 0
+        for block_hash in block_hashes:
+            block = self.blocks.get(block_hash)
+            if block is None or not block.is_ready:
+                break
+            hit_count += 1
+        return hit_count


Will there be any race condition that the block is evicted after lookup counts the block as hit? This could potentially cause some correctness issues in the scheduler.

For shared storage you're correct - there's a race.
But shared storage should actually not use LRU manager as the state is too big to keep in memory, and you would actually evict by some less exact mechanism than LRU (e.g. keeping each block for a fixed amount of time).
For CPU or local disk there is no race since they are fully owned by this specific scheduler instance.

vllm/v1/offloading/lru_manager.py

ApostaC

LGTM!

njhill

Thanks @orozery. I think my comments are all style/efficiency related.

njhill · 2025-09-12T21:20:54Z

vllm/v1/kv_offload/backend.py

+from vllm.v1.offloading.abstract import LoadStoreSpec
+
+
+class BlockStatus(ctypes.Structure):


Curious about the use of c struct for this in particular?

I want to reduce the memory footprint per block.
From what I've read this allows to use more compact representation.

vllm/v1/offloading/lru_manager.py

njhill · 2025-09-12T21:40:37Z

vllm/v1/offloading/backends/cpu.py

+        self.block_id = block_id
+
+    def get_load_store_spec(self, block_hash: int) -> CPULoadStoreSpec:
+        return CPULoadStoreSpec(self.block_id)


I may not be seeing the whole picture but do we need these wrapper types? We'll be creating many python objects when there's a lot of blocks...

Changed to a single wrapper per list of blocks

vllm/v1/offloading/lru_manager.py

njhill · 2025-09-12T21:53:45Z

vllm/v1/kv_offload/lru_manager.py

+        to_evict = []
+        if num_blocks_to_evict > 0:
+            for block_hash, block in self.blocks.items():
+                if block.ref_cnt == 0:


Would it be worth maintaining a count of unused blocks? then this check would be more efficient.

We need to evict blocks by their LRU.
Right now I'm using a single ordered dict that holds all blocks, both "used" and "unused" (ref_cnt==0).
Most blocks should be unused, especially those whose LRU is the oldest, so this should be pretty good.
I don't see a good way to optimize it.

vllm/v1/offloading/lru_manager.py

pytorch-bot · 2025-09-15T15:32:41Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

This commit introduces an OffloadingManager with a pluggable backend, which evicts blocks by LRU. A CPU backend is included. Signed-off-by: Or Ozeri <[email protected]>

…-project#20075) Signed-off-by: Or Ozeri <[email protected]>

…-project#20075) Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: charlifu <[email protected]>

…-project#20075) Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…-project#20075) Signed-off-by: Or Ozeri <[email protected]>

…-project#20075) Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

orozery requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners June 25, 2025 12:16

gemini-code-assist bot reviewed Jun 25, 2025

View reviewed changes

mergify bot added ci/build v1 labels Jun 25, 2025

gemini-code-assist bot reviewed Jun 25, 2025

View reviewed changes

guymguym reviewed Jul 1, 2025

View reviewed changes

vllm/v1/offloading/abstract.py Outdated Show resolved Hide resolved

orozery force-pushed the cpu-offloading-manager branch 3 times, most recently from a0cb125 to 11d5135 Compare July 8, 2025 10:19

josephrocca mentioned this pull request Jul 9, 2025

[RFC]: KV cache offloading #19854

Open

1 task

orozery force-pushed the cpu-offloading-manager branch from 11d5135 to d141661 Compare July 23, 2025 10:12

vMaroon mentioned this pull request Jul 28, 2025

vLLM Native CPU Offloading Connector llm-d/llm-d-kv-cache-manager#67

Open

orozery force-pushed the cpu-offloading-manager branch 2 times, most recently from a346c1b to 5c7cf02 Compare August 5, 2025 06:24

orozery mentioned this pull request Aug 10, 2025

[KV offload][4/N] Offloading KV connector #22595

Merged

orozery force-pushed the cpu-offloading-manager branch 5 times, most recently from 74d048a to dd72909 Compare August 14, 2025 12:38

orozery force-pushed the cpu-offloading-manager branch from dd72909 to 852be73 Compare September 4, 2025 08:17

orozery mentioned this pull request Sep 4, 2025

[KV offload][5/N] Add CPUOffloadingSpec #24251

Merged

orozery force-pushed the cpu-offloading-manager branch from 852be73 to e74cb03 Compare September 8, 2025 06:13

orozery mentioned this pull request Sep 9, 2025

[KV offload][1/N] Introduce an offloading component #19848

Merged

ApostaC reviewed Sep 10, 2025

View reviewed changes

orozery force-pushed the cpu-offloading-manager branch from e74cb03 to 3877a0b Compare September 10, 2025 08:03

ApostaC approved these changes Sep 10, 2025

View reviewed changes

ApostaC added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 11, 2025

njhill approved these changes Sep 12, 2025

View reviewed changes

orozery force-pushed the cpu-offloading-manager branch from 3877a0b to 220a35f Compare September 15, 2025 15:32

pytorch-bot bot removed the ci/build label Sep 15, 2025

mergify bot added the ci/build label Sep 15, 2025

orozery force-pushed the cpu-offloading-manager branch from 220a35f to 43b44ec Compare September 17, 2025 09:33

njhill approved these changes Sep 17, 2025

View reviewed changes

njhill changed the title ~~v1: Introduce LRU-based CPU offloading management~~ [KV offload][2/N] Introduce LRU-based CPU offloading management Sep 18, 2025

v1: Introduce LRU-based CPU offloading management

2f04f0b

This commit introduces an OffloadingManager with a pluggable backend, which evicts blocks by LRU. A CPU backend is included. Signed-off-by: Or Ozeri <[email protected]>

orozery force-pushed the cpu-offloading-manager branch from 43b44ec to 2f04f0b Compare September 18, 2025 21:05

njhill enabled auto-merge (squash) September 18, 2025 21:38

njhill merged commit 9d1c50a into vllm-project:main Sep 19, 2025
41 checks passed

ywang96 pushed a commit to ywang96/vllm that referenced this pull request Sep 19, 2025

[KV offload][2/N] Introduce LRU-based CPU offloading management (vllm…

df25287

…-project#20075) Signed-off-by: Or Ozeri <[email protected]>

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025

[KV offload][2/N] Introduce LRU-based CPU offloading management (vllm…

6778802

…-project#20075) Signed-off-by: Or Ozeri <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[KV offload][2/N] Introduce LRU-based CPU offloading management (vllm…

9e81d2a

…-project#20075) Signed-off-by: Or Ozeri <[email protected]>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[KV offload][2/N] Introduce LRU-based CPU offloading management (vllm…

0e4cffa

…-project#20075) Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: charlifu <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[KV offload][2/N] Introduce LRU-based CPU offloading management (vllm…

9b5084f

…-project#20075) Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[KV offload][2/N] Introduce LRU-based CPU offloading management (vllm…

6af0601

…-project#20075) Signed-off-by: Or Ozeri <[email protected]>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[KV offload][2/N] Introduce LRU-based CPU offloading management (vllm…

be08e8c

…-project#20075) Signed-off-by: Or Ozeri <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[KV offload][2/N] Introduce LRU-based CPU offloading management (vllm…

f17e0d4

…-project#20075) Signed-off-by: Or Ozeri <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

		from vllm.v1.offloading.abstract import LoadStoreSpec


		class BlockStatus(ctypes.Structure):

Uh oh!

[KV offload][2/N] Introduce LRU-based CPU offloading management #20075

[KV offload][2/N] Introduce LRU-based CPU offloading management #20075

Uh oh!

Conversation

orozery commented Jun 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

Uh oh!

KuntaiDu commented Jul 31, 2025

Uh oh!

ApostaC commented Sep 9, 2025

Uh oh!

orozery commented Sep 9, 2025

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

orozery commented Jun 25, 2025 •

edited by github-actions bot

Loading