Skip to content

Conversation

@quanliu1991
Copy link
Contributor

@quanliu1991 quanliu1991 commented May 30, 2025

Prefix caching enabled causes incorrect outputs due to stale num_new_tokens_cached in ComputedBlocksTracker

Description

We encountered an incorrect output issue when prefix caching is enabled. After investigation, we found that there's an inconsistency between the num_new_tokens_cached retrieved during schedule_prefills and the actual tokens computed and allocated in _allocate_and_set_running(seq_group).

This inconsistency leads to wrong block assignments in the prefix cache and subsequently incorrect inference results.


Root Cause Analysis

  1. In the schedule_prefills stage, the scheduler retrieves num_new_tokens_cached for a sequence and records it in ComputedBlocksTracker._seq_id_to_num_tokens_computed.
  2. However, if the sequence fails to be scheduled (e.g., due to lack of resources), the num_tokens_computed entry is not cleared from ComputedBlocksTracker.
  3. On the next scheduling attempt, the scheduler directly reads num_new_tokens_cached from the stale _seq_id_to_num_tokens_computed, rather than recalculating it from the actual prefix cached blocks.
  4. If the prefix cache blocks have changed during this period (e.g., from other sequences or eviction), the result becomes inconsistent and incorrect.

Proposed Fix

We submitted a PR that adds a call to free_seq_cached_tokens() in the following places to ensure consistency:

  • Before the break in _schedule_prefills when a sequence fails to be scheduled
  • In _schedule_priority_preemption when a waiting sequence has been queried by the 、_get_num_new_uncached_and_cached_tokens() method

This ensures that stale num_new_tokens_cached is removed, and future scheduling will correctly re-compute the value based on the current state of prefix cache blocks.


Reproduction

This issue can be reproduced under the following conditions:

  • Prefix caching is enabled
  • Sequences are frequently scheduled and preempted
  • Multiple sequences share similar prefix tokens but scheduling failures occur

Expected Behavior

Prefix cache blocks should match the actual token count computed for the sequence in ComputedBlocksTracker, and no stale cached token metadata should remain if scheduling fails.


Additional Notes

This fix ensures that cached token tracking is consistent with actual prefix cache state, especially under preemption or failure conditions, thus improving the correctness of prefix cache reuse logic.

FIX #18955

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

… stale num_new_tokens_cached in ComputedBlocksTracker

Signed-off-by: 刘全 <[email protected]>
@quanliu1991 quanliu1991 force-pushed the prefix-caching-enabled-causes-incorrect branch from a4f3fa3 to 1451157 Compare May 30, 2025 13:30
Signed-off-by: 刘全 <[email protected]>
@quanliu1991 quanliu1991 changed the title [Bugfix][core] Prefix caching enabled causes incorrect outputs [Bugfix][Core] Prefix caching enabled causes incorrect outputs Jun 1, 2025
@aarnphm
Copy link
Collaborator

aarnphm commented Jun 1, 2025

Csn u check if this behavior persist with c1? This is v0 codepath

@quanliu1991
Copy link
Contributor Author

quanliu1991 commented Jun 1, 2025

Csn u check if this behavior persist with c1? This is v0 codepath

I'll check this behavior against v1 to see if it exists there as well. If confirmed, I'll open a new issue or propose a fix with a PR.

@7d1-z
Copy link

7d1-z commented Jun 3, 2025

@aarnphm I follow the steps mentioned in 18955 and set VLLM_USE_V1=1, this problem does not exist.

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can stamp this for v0 bugfix, cc @DarkLight1337

@DarkLight1337 DarkLight1337 requested a review from heheda12345 June 4, 2025 02:56
@heheda12345
Copy link
Collaborator

Hi @DarkLight1337 ,

vllm/vllm/core/scheduler.py

Lines 1143 to 1150 in af7fc84

# We cannot mix sequence groups that use prompt embeds and
# those that do not.
if len(seq_groups) == 0:
using_prompt_embeds = seq_group.uses_prompt_embeds()
if using_prompt_embeds != seq_group.uses_prompt_embeds():
leftover_waiting_sequences.appendleft(seq_group)
waiting_queue.popleft()
continue

For this part, if if using_prompt_embeds != seq_group.uses_prompt_embeds(): happens, how will the request processed? Will the scheduler try to scheduled it again or mark it as finished?

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this bug. Can you write some unit test for it?
And some questions:

  1. Do we need to free in this branch?

    vllm/vllm/core/scheduler.py

    Lines 884 to 890 in af7fc84

    if (lora_int_id > 0 and (lora_int_id not in curr_loras)
    and len(curr_loras) >= self.lora_config.max_loras):
    # We don't have a space for another LoRA, so
    # we ignore this request for now.
    leftover_swapped.appendleft(seq_group)
    swapped_queue.popleft()
    continue

    And also not sure about here

    vllm/vllm/core/scheduler.py

    Lines 1143 to 1150 in af7fc84

    # We cannot mix sequence groups that use prompt embeds and
    # those that do not.
    if len(seq_groups) == 0:
    using_prompt_embeds = seq_group.uses_prompt_embeds()
    if using_prompt_embeds != seq_group.uses_prompt_embeds():
    leftover_waiting_sequences.appendleft(seq_group)
    waiting_queue.popleft()
    continue
  2. For requests be marked as FINISHED_IGNORED, will ComputedBlocksTracker free it? If not freed, maybe there will be some memory leak due to the bug you report.
  3. Can you help to analyze whether we need to free the tracker for other calls of _get_num_new_uncached_and_cached_tokens in scheduler?

self.block_tables[seq_id].free()
del self.block_tables[seq_id]

def free_seq_cached_tokens(self, seq: Sequence) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def free_seq_cached_tokens(self, seq: Sequence) -> None:
def remove_seq_from_computed_blocks_tracker(self, seq: Sequence) -> None:

free_seq_cached_tokens is a little ambiguous as people may think this function is freeing the block of some cached tokens in block_manager.
Does this function name look better for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it is good, I had renamed.

Comment on lines 1699 to 1700
seq.reset_state_for_recompute()
self._free_seq_group_cross_attn_blocks(seq_group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these two lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is not need, it had deleted.

@DarkLight1337
Copy link
Member

Hi @DarkLight1337 ,

vllm/vllm/core/scheduler.py

Lines 1143 to 1150 in af7fc84

# We cannot mix sequence groups that use prompt embeds and
# those that do not.
if len(seq_groups) == 0:
using_prompt_embeds = seq_group.uses_prompt_embeds()
if using_prompt_embeds != seq_group.uses_prompt_embeds():
leftover_waiting_sequences.appendleft(seq_group)
waiting_queue.popleft()
continue

For this part, if if using_prompt_embeds != seq_group.uses_prompt_embeds(): happens, how will the request processed? Will the scheduler try to scheduled it again or mark it as finished?

From my understanding, this forces the scheduler to schedule the remaining requests in the next step. This is so that for each step, all the requests have either using_prompt_embeds=False or using_prompt_embeds=True, but not a mix of both. cc @qthequartermasterman

@heheda12345
Copy link
Collaborator

Thanks @DarkLight1337 . @quanliu1991 Then I think we may need to add free logic to this code block too.

@qthequartermasterman
Copy link
Contributor

Hi @DarkLight1337 ,

vllm/vllm/core/scheduler.py

Lines 1143 to 1150 in af7fc84

# We cannot mix sequence groups that use prompt embeds and
# those that do not.
if len(seq_groups) == 0:
using_prompt_embeds = seq_group.uses_prompt_embeds()
if using_prompt_embeds != seq_group.uses_prompt_embeds():
leftover_waiting_sequences.appendleft(seq_group)
waiting_queue.popleft()
continue

For this part, if if using_prompt_embeds != seq_group.uses_prompt_embeds(): happens, how will the request processed? Will the scheduler try to scheduled it again or mark it as finished?

From my understanding, this forces the scheduler to schedule the remaining requests in the next step. This is so that for each step, all the requests have either using_prompt_embeds=False or using_prompt_embeds=True, but not a mix of both. cc @qthequartermasterman

This understanding is correct. Prompt embeds and token ids requests must remain segregated because they use different paths in the model forward pass, and thus different compiled graphs.

I'd be curious to see if this PR fixes the intermittent issue I've noticed with enabling prompt embeds and prefix caching simultaneously causes an occasional crash on very long prompts. I haven't figured out why it sometimes happens, because (a) for my particular use case, my work around is just to disable prefix caching or limit the max length, and (b) with the imminent deprecation of v0, I've been focusing on how to port the prompt embeds functionality to v1. That's a different conversation than this PR though.

@quanliu1991
Copy link
Contributor Author

@heheda12345 Thanks for the review! I’ll work on adding unit tests.

  1. In _schedule_swapped, we don’t call _get_num_new_uncached_and_cached_tokens, so there’s no need to free .

    vllm/vllm/core/scheduler.py

    Lines 884 to 890 in af7fc84

    if (lora_int_id > 0 and (lora_int_id not in curr_loras)
    and len(curr_loras) >= self.lora_config.max_loras):
    # We don't have a space for another LoRA, so
    # we ignore this request for now.
    leftover_swapped.appendleft(seq_group)
    swapped_queue.popleft()
    continue

    In schedule_prefills, there are two branches where free is needed — they call _get_num_new_uncached_and_cached_tokens, and the seq_group stays in the waiting_queue, so need to free.

    vllm/vllm/core/scheduler.py

    Lines 1143 to 1164 in af7fc84

    # We cannot mix sequence groups that use prompt embeds and
    # those that do not.
    if len(seq_groups) == 0:
    using_prompt_embeds = seq_group.uses_prompt_embeds()
    if using_prompt_embeds != seq_group.uses_prompt_embeds():
    leftover_waiting_sequences.appendleft(seq_group)
    waiting_queue.popleft()
    continue
    lora_int_id = 0
    if self.lora_enabled:
    lora_int_id = seq_group.lora_int_id
    assert curr_loras is not None
    assert self.lora_config is not None
    if (self.lora_enabled and lora_int_id > 0
    and lora_int_id not in curr_loras
    and len(curr_loras) >= self.lora_config.max_loras):
    # We don't have a space for another LoRA, so
    # we ignore this request for now.
    leftover_waiting_sequences.appendleft(seq_group)
    waiting_queue.popleft()
    continue
  2. That’s a great point. I hadn’t considered the memory leak issue here. You're right, we should free it.
  3. Got it. I’ll do a full check to see if there are any other places where we're missing a free

@heheda12345
Copy link
Collaborator

@quanliu1991 Sorry that I marked the wrong line for lora in my question 1. It should be this piece of LORA-related code

vllm/vllm/core/scheduler.py

Lines 1153 to 1164 in af7fc84

if self.lora_enabled:
lora_int_id = seq_group.lora_int_id
assert curr_loras is not None
assert self.lora_config is not None
if (self.lora_enabled and lora_int_id > 0
and lora_int_id not in curr_loras
and len(curr_loras) >= self.lora_config.max_loras):
# We don't have a space for another LoRA, so
# we ignore this request for now.
leftover_waiting_sequences.appendleft(seq_group)
waiting_queue.popleft()
continue

@aarnphm aarnphm self-requested a review June 5, 2025 22:35
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait until Chen's comments to be resolved before merging this in.

刘全 added 2 commits June 8, 2025 19:32
… stale num_new_tokens_cached in ComputedBlocksTracker

Signed-off-by: 刘全 <[email protected]>
@quanliu1991
Copy link
Contributor Author

@aarnphm @heheda12345 We have done a comprehensive check of all _get_num_new_uncached_and_cached_tokens call code within the schedule method, and added remove_seq calls in 5 additional branches.

  1. Summary of the 9 branches where remove_seq_from_computed_blocks_tracker() was added:
  • _schedule_swapped:

    vllm/vllm/core/scheduler.py

    Lines 900 to 904 in eaa2e51

    if num_new_tokens_uncached == 0 or not budget.can_schedule(
    num_new_tokens=num_new_tokens_uncached,
    num_new_seqs=num_new_seqs,
    ):
    break

  • _schedule_priority_preemption:
    Added remove_seq for waiting quene.

  • _schedule_prefills:
    Added remove_seq for 3 break branches.
    Added remove_seq for 2 continue branches where leftover_waiting_sequences.appendleft(seq_group) is called.
    Added remove_seq for 2 branches marked with FINISHED_IGNORED status.

  1. Regarding FINISHED_IGNORED:
    We did not find any place where seq_groups marked as FINISHED_IGNORED are freed.
    In contrast, we did confirm that running sequences that finish sequences are freed, as are those in _async_stopped.
    Therefore, we added a remove_seq call for FINISHED_IGNORED sequences.

  2. On _schedule_running and the No budget => Stop case
    I think the blocks used by seq_groups in the running queue won’t be evicted, so I didn’t add a remove_seq call.
    But I’m still not sure when the "No budget => Stop" case would occur.

    vllm/vllm/core/scheduler.py

    Lines 712 to 724 in eaa2e51

    num_uncached_new_tokens, _ = \
    self._get_num_new_uncached_and_cached_tokens(
    seq_group,
    SequenceStatus.RUNNING,
    enable_chunking,
    budget,
    partial_prefill_metadata,
    )
    num_running_tokens = num_uncached_new_tokens
    if num_running_tokens == 0:
    # No budget => Stop
    break

@quanliu1991 quanliu1991 force-pushed the prefix-caching-enabled-causes-incorrect branch 3 times, most recently from e1301f5 to fa5c763 Compare June 11, 2025 02:00
Signed-off-by: 刘全 <[email protected]>
@quanliu1991 quanliu1991 force-pushed the prefix-caching-enabled-causes-incorrect branch from fa5c763 to e6e04cf Compare June 11, 2025 02:18
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the detailed analysis.
@zh-jp @qthequartermasterman Does this PR fix your problem?

@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 15, 2025
@7d1-z
Copy link

7d1-z commented Jun 15, 2025

@heheda12345 Sure! Thanks for your review!

@heheda12345 heheda12345 changed the title [Bugfix][Core] Prefix caching enabled causes incorrect outputs [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker Jun 15, 2025
@heheda12345 heheda12345 enabled auto-merge (squash) June 15, 2025 13:42
@simon-mo simon-mo disabled auto-merge June 16, 2025 04:56
@simon-mo simon-mo merged commit 92183b4 into vllm-project:main Jun 16, 2025
72 of 76 checks passed
@noooop
Copy link
Collaborator

noooop commented Jun 16, 2025

@quanliu1991

this pr will break pooling(Embedding, CrossEncoding....) models

from vllm import LLM

model = LLM(model="intfloat/e5-small", task="embed")

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
] * 1000

outputs = model.embed(prompts)

for prompt, output in zip(prompts, outputs):
    pass

Error:

File /share/PycharmProjects/noooop_vllm/vllm/core/scheduler.py:1277, in Scheduler._schedule_default(self)
   1275 # If any requests are swapped, prioritized swapped requests.
   1276 if not self.swapped:
-> 1277     prefills = self._schedule_prefills(budget,
   1278                                        curr_loras,
   1279                                        enable_chunking=False)
   1281 if len(prefills.seq_groups
   1282        ) == 0 and self.scheduler_config.policy == "priority":
   1283     self._schedule_priority_preemption(budget)

File /share/PycharmProjects/noooop_vllm/vllm/core/scheduler.py:1195, in Scheduler._schedule_prefills(self, budget, curr_loras, enable_chunking, partial_prefill_metadata)
   1190 num_new_seqs = seq_group.get_max_num_running_seqs()
   1191 if num_new_tokens_uncached == 0 or not budget.can_schedule(
   1192         num_new_tokens=num_new_tokens_uncached,
   1193         num_new_seqs=num_new_seqs,
   1194 ):
-> 1195     self.remove_seq_from_computed_blocks_tracker(
   1196         seq_group, SequenceStatus.WAITING)
   1197     break
   1199 # Can schedule this request.

File /share/PycharmProjects/noooop_vllm/vllm/core/scheduler.py:1715, in Scheduler.remove_seq_from_computed_blocks_tracker(self, seq_group, status)
   1713 seqs = seq_group.get_seqs(status=status)
   1714 for seq in seqs:
-> 1715     self._remove_seq_from_computed_blocks_tracker(seq)

File /share/PycharmProjects/noooop_vllm/vllm/core/scheduler.py:1722, in Scheduler._remove_seq_from_computed_blocks_tracker(self, seq)
   1717 def _remove_seq_from_computed_blocks_tracker(self, seq: Sequence) -> None:
   1718     """
   1719     Free a sequence computed blocks tracker _seq_id_to_blocks_hashes
   1720     and _seq_id_to_num_tokens_computed.
   1721     """
-> 1722     self.block_manager.remove_seq_from_computed_blocks_tracker(seq)

AttributeError: 'PlaceholderBlockSpaceManager' object has no attribute 'remove_seq_from_computed_blocks_tracker'

PTAL https://buildkite.com/vllm/ci/builds/22110#01977721-962e-413e-982f-5211bbecb3d6

@quanliu1991
Copy link
Contributor Author

@noooop Sorry, PlaceholderBlockSpaceManager lacks this method. I need to consider the scope of use of this method and make reasonable modifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Under high concurrency, kvcache will be tampered with, causing duplicate characters or gibberish in subsequent request results

8 participants