[Model] Allow users to control skip reading cache per request. #28194

noooop · 2025-11-06T08:12:54Z

Purpose

Even if prefix caching is enabled, it cannot read caching in the following two scenarios:

prompt logprobs
all pooling

Otherwise, the output might be less than n_prompt_tokens.

skip_reading_caching can still write to caching to accelerate following requests

Address #27145 (comment)

Test Plan

tests/models/language/pooling/test_extract_hidden_states.py

Test Result

main:

Even if chunked_prefill is not enabled, prefix_caching + all pooling will still cause the following error.
(EngineCore_DP0 pid=2893225) AssertionError: partial prefill not supported with ALL pooling

n_prompt_tokens:55 n_output_tokens:55 num_cached_tokens:0
n_prompt_tokens:56 n_output_tokens:8 num_cached_tokens:48
n_prompt_tokens:57 n_output_tokens:8 num_cached_tokens:48

n_prompt_tokens:55 n_output_tokens:48 num_cached_tokens:7
n_prompt_tokens:56 n_output_tokens:8 num_cached_tokens:48
n_prompt_tokens:57 n_output_tokens:8 num_cached_tokens:48

this pr:

Turn off prefix_caching during all pooling requests.

n_prompt_tokens:55 n_output_tokens:55 num_cached_tokens:0
n_prompt_tokens:56 n_output_tokens:56 num_cached_tokens:0
n_prompt_tokens:57 n_output_tokens:57  num_cached_tokens:0

n_prompt_tokens:55 n_output_tokens:55  num_cached_tokens:0
n_prompt_tokens:56 n_output_tokens:56  num_cached_tokens:0
n_prompt_tokens:57 n_output_tokens:57 num_cached_tokens:0

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a mechanism to disable prefix caching on a per-request basis for scenarios where it can lead to incorrect output, specifically for requests involving prompt_logprobs or certain pooling tasks. The changes correctly add a not_using_prefix_caching flag to PoolingParams and SamplingParams and use it in the KVCacheManager. However, there is a critical bug in the implementation for SamplingParams where the logic to set this flag is inverted, which would cause significant performance degradation and incorrect behavior. I have provided a comment with a suggested fix for this critical issue.

vllm/sampling_params.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/sampling_params.py

noooop · 2025-11-07T06:09:11Z

/gemini review

noooop · 2025-11-07T06:09:23Z

cc @DarkLight1337
Ready for review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/sampling_params.py

gemini-code-assist

Code Review

This pull request introduces a mechanism to control prefix caching on a per-request basis, which is necessary for features like prompt logprobs and 'all' pooling that are incompatible with prefix caching. The changes involve adding a disable_prefix_caching flag to SamplingParams and PoolingParams and updating the KV cache manager to respect this flag. The implementation for pooling parameters is correct, but there is a critical logic error in how the flag is set for sampling parameters, which I've commented on. The rest of the changes and the overall approach look good.

vllm/sampling_params.py

Signed-off-by: wang.yuqi <[email protected]>

vllm/v1/core/kv_cache_manager.py

Signed-off-by: wang.yuqi <[email protected]>

vllm/v1/core/kv_cache_manager.py

Signed-off-by: wang.yuqi <[email protected]>

vllm/pooling_params.py

vllm/v1/core/kv_cache_manager.py

heheda12345 · 2025-11-11T04:39:39Z

vllm/v1/request.py

        return len(self._output_token_ids)

+    @property
+    def skip_reading_cache(self) -> bool:


To accelerate it a bit, you can:

initialize the value of skip_reading_cache in process_inputs of vllm/v1/engine/processor.py

Put it in EngineCoreRequest

copy from EngineCoreRequest to Request by from_engine_core_request

In this way, skip_reading_cache will be computed in the frontend process rather than the engine core busy loop. (Though there won't be too much speedup, I want to avoid unnecessary operation in KVCacheManager as much as possible)

I think the get_skip_reading_prefix_cache logic best belongs in the Request.

I have cached the result of skip_reading_prefix_cache.

yeah caching also make sense to me

heheda12345 · 2025-11-11T04:43:55Z

emmm... what about skip_reading_prefix_cache? There are too many types of cache in vllm

Signed-off-by: wang.yuqi <[email protected]>

heheda12345

LGTM!

…project#28194) Signed-off-by: wang.yuqi <[email protected]> Signed-off-by: wang.yuqi <[email protected]> Signed-off-by: Bram Wasti <[email protected]>

noooop requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners November 6, 2025 08:12

mergify bot added the v1 label Nov 6, 2025

gemini-code-assist bot reviewed Nov 6, 2025

View reviewed changes

vllm/sampling_params.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 6, 2025

View reviewed changes

vllm/sampling_params.py Outdated Show resolved Hide resolved

noooop marked this pull request as draft November 6, 2025 08:27

noooop changed the title ~~[Core] Allow users to control not using prefix caching per request.~~ [Model] Allow users to control not using prefix caching per request. Nov 7, 2025

noooop marked this pull request as ready for review November 7, 2025 06:07

chatgpt-codex-connector bot reviewed Nov 7, 2025

View reviewed changes

vllm/sampling_params.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

vllm/sampling_params.py Outdated Show resolved Hide resolved

init

363d087

Signed-off-by: wang.yuqi <[email protected]>

noooop force-pushed the disable_prefix_caching_per_request branch from b3fd081 to 363d087 Compare November 7, 2025 06:59

noooop added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025

Merge branch 'main' into disable_prefix_caching_per_request

eb30d1e

heheda12345 reviewed Nov 7, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

noooop and others added 5 commits November 7, 2025 16:32

+ request.disable_prefix_caching

b75d0d1

Signed-off-by: wang.yuqi <[email protected]>

fix

eebbd49

Signed-off-by: wang.yuqi <[email protected]>

fix

fde509e

Signed-off-by: wang.yuqi <[email protected]>

fix

9f22f49

Signed-off-by: wang.yuqi <[email protected]>

Merge branch 'main' into disable_prefix_caching_per_request

1d8caf9

noooop changed the title ~~[Model] Allow users to control not using prefix caching per request.~~ [Model] Allow users to control skip prefix caching per request. Nov 7, 2025

skip_prefix_caching

41cafdd

Signed-off-by: wang.yuqi <[email protected]>

skip_reading_caching

74ae32d

Signed-off-by: wang.yuqi <[email protected]>

noooop changed the title ~~[Model] Allow users to control skip prefix caching per request.~~ [Model] Allow users to control skip reading caching per request. Nov 8, 2025

noooop changed the title ~~[Model] Allow users to control skip reading caching per request.~~ [Model] Allow users to control skip reading cache per request. Nov 8, 2025

noooop added 2 commits November 8, 2025 12:54

skip_reading_cache

8f18945

Signed-off-by: wang.yuqi <[email protected]>

Merge branch 'main' into disable_prefix_caching_per_request

185bcab

noooop changed the title ~~[Model] Allow users to control skip reading cache per request.~~ [Core] Allow users to control skip reading cache per request. Nov 8, 2025

fix

8ebcf3a

Signed-off-by: wang.yuqi <[email protected]>

noooop commented Nov 8, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

Merge branch 'main' into disable_prefix_caching_per_request

caa8613

noooop changed the title ~~[Core] Allow users to control skip reading cache per request.~~ [Model] Allow users to control skip reading cache per request. Nov 8, 2025

noooop and others added 3 commits November 10, 2025 10:17

Merge branch 'main' into disable_prefix_caching_per_request

c090013

fix plugin

e06775b

Signed-off-by: wang.yuqi <[email protected]>

Merge branch 'main' into disable_prefix_caching_per_request

9be967b

noooop commented Nov 10, 2025

View reviewed changes

vllm/pooling_params.py Show resolved Hide resolved

This was referenced Nov 11, 2025

[Doc][Last/N] Improve all pooling task | Refactor pooling-related documentation #27963

Draft

[Model][6/N] Improve all pooling task | Support chunked prefill with ALL pooling #27145

Open

heheda12345 reviewed Nov 11, 2025

View reviewed changes

noooop added 6 commits November 11, 2025 13:11

skip_reading_prefix_cache

ffe437d

Signed-off-by: wang.yuqi <[email protected]>

Merge branch 'main' into disable_prefix_caching_per_request

0d66556

skip_reading_prefix_cache

716a3c2

Signed-off-by: wang.yuqi <[email protected]>

reset_prefix_cache

2ba1726

Signed-off-by: wang.yuqi <[email protected]>

Merge branch 'main' into disable_prefix_caching_per_request

2cde27f

- reset_prefix_cache

797256b

Signed-off-by: wang.yuqi <[email protected]>

noooop requested a review from heheda12345 November 16, 2025 03:46

heheda12345 approved these changes Nov 16, 2025

View reviewed changes

heheda12345 merged commit a55b646 into vllm-project:main Nov 16, 2025
49 checks passed

noooop deleted the disable_prefix_caching_per_request branch November 16, 2025 08:19

noooop mentioned this pull request Nov 21, 2025

[Model] Improve enable chunked_prefill & prefix_caching logic. #26623

Open

5 tasks

Uh oh!

[Model] Allow users to control skip reading cache per request. #28194

[Model] Allow users to control skip reading cache per request. #28194

Uh oh!

Conversation

noooop commented Nov 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

noooop commented Nov 7, 2025

Uh oh!

noooop commented Nov 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

noooop Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Nov 11, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

noooop commented Nov 6, 2025 •

edited by github-actions bot

Loading

noooop Nov 11, 2025 •

edited

Loading