-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Model] Allow users to control skip reading cache per request. #28194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Allow users to control skip reading cache per request. #28194
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to disable prefix caching on a per-request basis for scenarios where it can lead to incorrect output, specifically for requests involving prompt_logprobs or certain pooling tasks. The changes correctly add a not_using_prefix_caching flag to PoolingParams and SamplingParams and use it in the KVCacheManager. However, there is a critical bug in the implementation for SamplingParams where the logic to set this flag is inverted, which would cause significant performance degradation and incorrect behavior. I have provided a comment with a suggested fix for this critical issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
/gemini review |
|
cc @DarkLight1337 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to control prefix caching on a per-request basis, which is necessary for features like prompt logprobs and 'all' pooling that are incompatible with prefix caching. The changes involve adding a disable_prefix_caching flag to SamplingParams and PoolingParams and updating the KV cache manager to respect this flag. The implementation for pooling parameters is correct, but there is a critical logic error in how the flag is set for sampling parameters, which I've commented on. The rest of the changes and the overall approach look good.
Signed-off-by: wang.yuqi <[email protected]>
b3fd081 to
363d087
Compare
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
vllm/v1/request.py
Outdated
| return len(self._output_token_ids) | ||
|
|
||
| @property | ||
| def skip_reading_cache(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To accelerate it a bit, you can:
- initialize the value of skip_reading_cache in process_inputs of
vllm/v1/engine/processor.py - Put it in EngineCoreRequest
- copy from EngineCoreRequest to Request by
from_engine_core_request
In this way, skip_reading_cache will be computed in the frontend process rather than the engine core busy loop. (Though there won't be too much speedup, I want to avoid unnecessary operation in KVCacheManager as much as possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the get_skip_reading_prefix_cache logic best belongs in the Request.
I have cached the result of skip_reading_prefix_cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah caching also make sense to me
|
emmm... what about |
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
heheda12345
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…project#28194) Signed-off-by: wang.yuqi <[email protected]> Signed-off-by: wang.yuqi <[email protected]> Signed-off-by: Bram Wasti <[email protected]>
Purpose
Even if prefix caching is enabled, it cannot read caching in the following two scenarios:
Otherwise, the output might be less than n_prompt_tokens.
skip_reading_caching can still write to caching to accelerate following requests
Address #27145 (comment)
Test Plan
tests/models/language/pooling/test_extract_hidden_states.py
Test Result
main:
Even if chunked_prefill is not enabled, prefix_caching + all pooling will still cause the following error.
(EngineCore_DP0 pid=2893225) AssertionError: partial prefill not supported with ALL pooling
this pr:
Turn off prefix_caching during all pooling requests.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.