-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker #18957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker #18957
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
… stale num_new_tokens_cached in ComputedBlocksTracker Signed-off-by: 刘全 <[email protected]>
a4f3fa3 to
1451157
Compare
Signed-off-by: 刘全 <[email protected]>
|
Csn u check if this behavior persist with c1? This is v0 codepath |
I'll check this behavior against v1 to see if it exists there as well. If confirmed, I'll open a new issue or propose a fix with a PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can stamp this for v0 bugfix, cc @DarkLight1337
|
Hi @DarkLight1337 , Lines 1143 to 1150 in af7fc84
For this part, if if using_prompt_embeds != seq_group.uses_prompt_embeds(): happens, how will the request processed? Will the scheduler try to scheduled it again or mark it as finished?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this bug. Can you write some unit test for it?
And some questions:
- Do we need to free in this branch?
Lines 884 to 890 in af7fc84
if (lora_int_id > 0 and (lora_int_id not in curr_loras) and len(curr_loras) >= self.lora_config.max_loras): # We don't have a space for another LoRA, so # we ignore this request for now. leftover_swapped.appendleft(seq_group) swapped_queue.popleft() continue
And also not sure about here
Lines 1143 to 1150 in af7fc84
# We cannot mix sequence groups that use prompt embeds and # those that do not. if len(seq_groups) == 0: using_prompt_embeds = seq_group.uses_prompt_embeds() if using_prompt_embeds != seq_group.uses_prompt_embeds(): leftover_waiting_sequences.appendleft(seq_group) waiting_queue.popleft() continue - For requests be marked as
FINISHED_IGNORED, will ComputedBlocksTracker free it? If not freed, maybe there will be some memory leak due to the bug you report. - Can you help to analyze whether we need to free the tracker for other calls of
_get_num_new_uncached_and_cached_tokensin scheduler?
vllm/core/block_manager.py
Outdated
| self.block_tables[seq_id].free() | ||
| del self.block_tables[seq_id] | ||
|
|
||
| def free_seq_cached_tokens(self, seq: Sequence) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def free_seq_cached_tokens(self, seq: Sequence) -> None: | |
| def remove_seq_from_computed_blocks_tracker(self, seq: Sequence) -> None: |
free_seq_cached_tokens is a little ambiguous as people may think this function is freeing the block of some cached tokens in block_manager.
Does this function name look better for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it is good, I had renamed.
vllm/core/scheduler.py
Outdated
| seq.reset_state_for_recompute() | ||
| self._free_seq_group_cross_attn_blocks(seq_group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need these two lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is not need, it had deleted.
From my understanding, this forces the scheduler to schedule the remaining requests in the next step. This is so that for each step, all the requests have either |
|
Thanks @DarkLight1337 . @quanliu1991 Then I think we may need to add free logic to this code block too. |
This understanding is correct. Prompt embeds and token ids requests must remain segregated because they use different paths in the model forward pass, and thus different compiled graphs. I'd be curious to see if this PR fixes the intermittent issue I've noticed with enabling prompt embeds and prefix caching simultaneously causes an occasional crash on very long prompts. I haven't figured out why it sometimes happens, because (a) for my particular use case, my work around is just to disable prefix caching or limit the max length, and (b) with the imminent deprecation of v0, I've been focusing on how to port the prompt embeds functionality to v1. That's a different conversation than this PR though. |
|
@heheda12345 Thanks for the review! I’ll work on adding unit tests.
|
|
@quanliu1991 Sorry that I marked the wrong line for lora in my question 1. It should be this piece of LORA-related code Lines 1153 to 1164 in af7fc84
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will wait until Chen's comments to be resolved before merging this in.
… stale num_new_tokens_cached in ComputedBlocksTracker Signed-off-by: 刘全 <[email protected]>
Signed-off-by: 刘全 <[email protected]>
|
@aarnphm @heheda12345 We have done a comprehensive check of all
|
e1301f5 to
fa5c763
Compare
Signed-off-by: 刘全 <[email protected]>
fa5c763 to
e6e04cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the detailed analysis.
@zh-jp @qthequartermasterman Does this PR fix your problem?
|
@heheda12345 Sure! Thanks for your review! |
|
this pr will break pooling(Embedding, CrossEncoding....) models from vllm import LLM
model = LLM(model="intfloat/e5-small", task="embed")
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
] * 1000
outputs = model.embed(prompts)
for prompt, output in zip(prompts, outputs):
passError: PTAL https://buildkite.com/vllm/ci/builds/22110#01977721-962e-413e-982f-5211bbecb3d6 |
|
@noooop Sorry, PlaceholderBlockSpaceManager lacks this method. I need to consider the scope of use of this method and make reasonable modifications. |
Prefix caching enabled causes incorrect outputs due to stale num_new_tokens_cached in ComputedBlocksTracker
Description
We encountered an incorrect output issue when prefix caching is enabled. After investigation, we found that there's an inconsistency between the
num_new_tokens_cachedretrieved duringschedule_prefillsand the actual tokens computed and allocated in_allocate_and_set_running(seq_group).This inconsistency leads to wrong block assignments in the prefix cache and subsequently incorrect inference results.
Root Cause Analysis
schedule_prefillsstage, the scheduler retrievesnum_new_tokens_cachedfor a sequence and records it inComputedBlocksTracker._seq_id_to_num_tokens_computed.num_tokens_computedentry is not cleared fromComputedBlocksTracker.num_new_tokens_cachedfrom the stale_seq_id_to_num_tokens_computed, rather than recalculating it from the actual prefix cached blocks.Proposed Fix
We submitted a PR that adds a call to
free_seq_cached_tokens()in the following places to ensure consistency:breakin_schedule_prefillswhen a sequence fails to be scheduled_schedule_priority_preemptionwhen a waiting sequence has been queried by the 、_get_num_new_uncached_and_cached_tokens()methodThis ensures that stale
num_new_tokens_cachedis removed, and future scheduling will correctly re-compute the value based on the current state of prefix cache blocks.Reproduction
This issue can be reproduced under the following conditions:
Expected Behavior
Prefix cache blocks should match the actual token count computed for the sequence in
ComputedBlocksTracker, and no stale cached token metadata should remain if scheduling fails.Additional Notes
This fix ensures that cached token tracking is consistent with actual prefix cache state, especially under preemption or failure conditions, thus improving the correctness of prefix cache reuse logic.
FIX #18955