Skip to content

Conversation

@noooop
Copy link
Collaborator

@noooop noooop commented Nov 6, 2025

Purpose

Even if prefix caching is enabled, it cannot read caching in the following two scenarios:

  • prompt logprobs
  • all pooling

Otherwise, the output might be less than n_prompt_tokens.

skip_reading_caching can still write to caching to accelerate following requests

Address #27145 (comment)

Test Plan

tests/models/language/pooling/test_extract_hidden_states.py

Test Result

main:

Even if chunked_prefill is not enabled, prefix_caching + all pooling will still cause the following error.
(EngineCore_DP0 pid=2893225) AssertionError: partial prefill not supported with ALL pooling

n_prompt_tokens:55 n_output_tokens:55 num_cached_tokens:0
n_prompt_tokens:56 n_output_tokens:8 num_cached_tokens:48
n_prompt_tokens:57 n_output_tokens:8 num_cached_tokens:48

n_prompt_tokens:55 n_output_tokens:48 num_cached_tokens:7
n_prompt_tokens:56 n_output_tokens:8 num_cached_tokens:48
n_prompt_tokens:57 n_output_tokens:8 num_cached_tokens:48

this pr:

Turn off prefix_caching during all pooling requests.

n_prompt_tokens:55 n_output_tokens:55 num_cached_tokens:0
n_prompt_tokens:56 n_output_tokens:56 num_cached_tokens:0
n_prompt_tokens:57 n_output_tokens:57  num_cached_tokens:0

n_prompt_tokens:55 n_output_tokens:55  num_cached_tokens:0
n_prompt_tokens:56 n_output_tokens:56  num_cached_tokens:0
n_prompt_tokens:57 n_output_tokens:57 num_cached_tokens:0

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to disable prefix caching on a per-request basis for scenarios where it can lead to incorrect output, specifically for requests involving prompt_logprobs or certain pooling tasks. The changes correctly add a not_using_prefix_caching flag to PoolingParams and SamplingParams and use it in the KVCacheManager. However, there is a critical bug in the implementation for SamplingParams where the logic to set this flag is inverted, which would cause significant performance degradation and incorrect behavior. I have provided a comment with a suggested fix for this critical issue.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@noooop noooop marked this pull request as draft November 6, 2025 08:27
@noooop noooop changed the title [Core] Allow users to control not using prefix caching per request. [Model] Allow users to control not using prefix caching per request. Nov 7, 2025
@noooop noooop marked this pull request as ready for review November 7, 2025 06:07
@noooop
Copy link
Collaborator Author

noooop commented Nov 7, 2025

/gemini review

@noooop
Copy link
Collaborator Author

noooop commented Nov 7, 2025

cc @DarkLight1337
Ready for review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to control prefix caching on a per-request basis, which is necessary for features like prompt logprobs and 'all' pooling that are incompatible with prefix caching. The changes involve adding a disable_prefix_caching flag to SamplingParams and PoolingParams and updating the KV cache manager to respect this flag. The implementation for pooling parameters is correct, but there is a critical logic error in how the flag is set for sampling parameters, which I've commented on. The rest of the changes and the overall approach look good.

Signed-off-by: wang.yuqi <[email protected]>
@noooop noooop force-pushed the disable_prefix_caching_per_request branch from b3fd081 to 363d087 Compare November 7, 2025 06:59
@noooop noooop added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025
noooop and others added 5 commits November 7, 2025 16:32
@noooop noooop changed the title [Model] Allow users to control not using prefix caching per request. [Model] Allow users to control skip prefix caching per request. Nov 7, 2025
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
@noooop noooop changed the title [Model] Allow users to control skip prefix caching per request. [Model] Allow users to control skip reading caching per request. Nov 8, 2025
@noooop noooop changed the title [Model] Allow users to control skip reading caching per request. [Model] Allow users to control skip reading cache per request. Nov 8, 2025
@noooop noooop changed the title [Model] Allow users to control skip reading cache per request. [Core] Allow users to control skip reading cache per request. Nov 8, 2025
Signed-off-by: wang.yuqi <[email protected]>
@noooop noooop changed the title [Core] Allow users to control skip reading cache per request. [Model] Allow users to control skip reading cache per request. Nov 8, 2025
return len(self._output_token_ids)

@property
def skip_reading_cache(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To accelerate it a bit, you can:

  1. initialize the value of skip_reading_cache in process_inputs of vllm/v1/engine/processor.py
  2. Put it in EngineCoreRequest
  3. copy from EngineCoreRequest to Request by from_engine_core_request

In this way, skip_reading_cache will be computed in the frontend process rather than the engine core busy loop. (Though there won't be too much speedup, I want to avoid unnecessary operation in KVCacheManager as much as possible)

Copy link
Collaborator Author

@noooop noooop Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the get_skip_reading_prefix_cache logic best belongs in the Request.

I have cached the result of skip_reading_prefix_cache.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah caching also make sense to me

@heheda12345
Copy link
Collaborator

emmm... what about skip_reading_prefix_cache? There are too many types of cache in vllm

@noooop noooop requested a review from heheda12345 November 16, 2025 03:46
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@heheda12345 heheda12345 merged commit a55b646 into vllm-project:main Nov 16, 2025
49 checks passed
@noooop noooop deleted the disable_prefix_caching_per_request branch November 16, 2025 08:19
bwasti pushed a commit to bwasti/vllm that referenced this pull request Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants