[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker #18957

quanliu1991 · 2025-05-30T12:21:02Z

Prefix caching enabled causes incorrect outputs due to stale num_new_tokens_cached in ComputedBlocksTracker

Description

We encountered an incorrect output issue when prefix caching is enabled. After investigation, we found that there's an inconsistency between the num_new_tokens_cached retrieved during schedule_prefills and the actual tokens computed and allocated in _allocate_and_set_running(seq_group).

This inconsistency leads to wrong block assignments in the prefix cache and subsequently incorrect inference results.

Root Cause Analysis

In the schedule_prefills stage, the scheduler retrieves num_new_tokens_cached for a sequence and records it in ComputedBlocksTracker._seq_id_to_num_tokens_computed.
However, if the sequence fails to be scheduled (e.g., due to lack of resources), the num_tokens_computed entry is not cleared from ComputedBlocksTracker.
On the next scheduling attempt, the scheduler directly reads num_new_tokens_cached from the stale _seq_id_to_num_tokens_computed, rather than recalculating it from the actual prefix cached blocks.
If the prefix cache blocks have changed during this period (e.g., from other sequences or eviction), the result becomes inconsistent and incorrect.

Proposed Fix

We submitted a PR that adds a call to free_seq_cached_tokens() in the following places to ensure consistency:

Before the break in _schedule_prefills when a sequence fails to be scheduled
In _schedule_priority_preemption when a waiting sequence has been queried by the 、_get_num_new_uncached_and_cached_tokens() method

This ensures that stale num_new_tokens_cached is removed, and future scheduling will correctly re-compute the value based on the current state of prefix cache blocks.

Reproduction

This issue can be reproduced under the following conditions:

Prefix caching is enabled
Sequences are frequently scheduled and preempted
Multiple sequences share similar prefix tokens but scheduling failures occur

Expected Behavior

Prefix cache blocks should match the actual token count computed for the sequence in ComputedBlocksTracker, and no stale cached token metadata should remain if scheduling fails.

Additional Notes

This fix ensures that cached token tracking is consistent with actual prefix cache state, especially under preemption or failure conditions, thus improving the correctness of prefix cache reuse logic.

FIX #18955

github-actions · 2025-05-30T12:21:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

… stale num_new_tokens_cached in ComputedBlocksTracker Signed-off-by: 刘全 <[email protected]>

Signed-off-by: 刘全 <[email protected]>

aarnphm · 2025-06-01T15:01:08Z

Csn u check if this behavior persist with c1? This is v0 codepath

quanliu1991 · 2025-06-01T17:52:44Z

Csn u check if this behavior persist with c1? This is v0 codepath

I'll check this behavior against v1 to see if it exists there as well. If confirmed, I'll open a new issue or propose a fix with a PR.

7d1-z · 2025-06-03T10:32:33Z

@aarnphm I follow the steps mentioned in 18955 and set VLLM_USE_V1=1, this problem does not exist.

aarnphm

I think we can stamp this for v0 bugfix, cc @DarkLight1337

heheda12345 · 2025-06-05T06:22:50Z

Hi @DarkLight1337 ,

vllm/vllm/core/scheduler.py

Lines 1143 to 1150 in af7fc84

    
           # We cannot mix sequence groups that use prompt embeds and 
        
           # those that do not. 
        
           if len(seq_groups) == 0: 
        
               using_prompt_embeds = seq_group.uses_prompt_embeds() 
        
           if using_prompt_embeds != seq_group.uses_prompt_embeds(): 
        
               leftover_waiting_sequences.appendleft(seq_group) 
        
               waiting_queue.popleft() 
        
               continue

For this part, if if using_prompt_embeds != seq_group.uses_prompt_embeds(): happens, how will the request processed? Will the scheduler try to scheduled it again or mark it as finished?

heheda12345

Thanks for catching this bug. Can you write some unit test for it?
And some questions:

Do we need to free in this branch?

vllm/vllm/core/scheduler.py

Lines 884 to 890 in af7fc84

    
           if (lora_int_id > 0 and (lora_int_id not in curr_loras) 
        
                   and len(curr_loras) >= self.lora_config.max_loras): 
        
               # We don't have a space for another LoRA, so 
        
               # we ignore this request for now. 
        
               leftover_swapped.appendleft(seq_group) 
        
               swapped_queue.popleft() 
        
               continue

And also not sure about here

vllm/vllm/core/scheduler.py

Lines 1143 to 1150 in af7fc84

    
           # We cannot mix sequence groups that use prompt embeds and 
        
           # those that do not. 
        
           if len(seq_groups) == 0: 
        
               using_prompt_embeds = seq_group.uses_prompt_embeds() 
        
           if using_prompt_embeds != seq_group.uses_prompt_embeds(): 
        
               leftover_waiting_sequences.appendleft(seq_group) 
        
               waiting_queue.popleft() 
        
               continue

For requests be marked as FINISHED_IGNORED, will ComputedBlocksTracker free it? If not freed, maybe there will be some memory leak due to the bug you report.
Can you help to analyze whether we need to free the tracker for other calls of _get_num_new_uncached_and_cached_tokens in scheduler?

heheda12345 · 2025-06-05T06:32:18Z

vllm/core/block_manager.py

        self.block_tables[seq_id].free()
        del self.block_tables[seq_id]

+    def free_seq_cached_tokens(self, seq: Sequence) -> None:


Suggested change

def free_seq_cached_tokens(self, seq: Sequence) -> None:

def remove_seq_from_computed_blocks_tracker(self, seq: Sequence) -> None:

free_seq_cached_tokens is a little ambiguous as people may think this function is freeing the block of some cached tokens in block_manager.
Does this function name look better for you?

OK, it is good, I had renamed.

heheda12345 · 2025-06-05T06:33:15Z

vllm/core/scheduler.py

+            seq.reset_state_for_recompute()
+        self._free_seq_group_cross_attn_blocks(seq_group)


why do we need these two lines?

The code is not need, it had deleted.

DarkLight1337 · 2025-06-05T09:26:06Z

Hi @DarkLight1337 ,

vllm/vllm/core/scheduler.py

Lines 1143 to 1150 in af7fc84

# We cannot mix sequence groups that use prompt embeds and

# those that do not.

if len(seq_groups) == 0:

using_prompt_embeds = seq_group.uses_prompt_embeds()

if using_prompt_embeds != seq_group.uses_prompt_embeds():

leftover_waiting_sequences.appendleft(seq_group)

waiting_queue.popleft()

continue

For this part, if if using_prompt_embeds != seq_group.uses_prompt_embeds(): happens, how will the request processed? Will the scheduler try to scheduled it again or mark it as finished?

From my understanding, this forces the scheduler to schedule the remaining requests in the next step. This is so that for each step, all the requests have either using_prompt_embeds=False or using_prompt_embeds=True, but not a mix of both. cc @qthequartermasterman

heheda12345 · 2025-06-05T09:42:21Z

Thanks @DarkLight1337 . @quanliu1991 Then I think we may need to add free logic to this code block too.

qthequartermasterman · 2025-06-05T11:42:14Z

Hi @DarkLight1337 ,

vllm/vllm/core/scheduler.py

Lines 1143 to 1150 in af7fc84

# We cannot mix sequence groups that use prompt embeds and

# those that do not.

if len(seq_groups) == 0:

using_prompt_embeds = seq_group.uses_prompt_embeds()

if using_prompt_embeds != seq_group.uses_prompt_embeds():

leftover_waiting_sequences.appendleft(seq_group)

waiting_queue.popleft()

continue

For this part, if if using_prompt_embeds != seq_group.uses_prompt_embeds(): happens, how will the request processed? Will the scheduler try to scheduled it again or mark it as finished?

From my understanding, this forces the scheduler to schedule the remaining requests in the next step. This is so that for each step, all the requests have either using_prompt_embeds=False or using_prompt_embeds=True, but not a mix of both. cc @qthequartermasterman

This understanding is correct. Prompt embeds and token ids requests must remain segregated because they use different paths in the model forward pass, and thus different compiled graphs.

I'd be curious to see if this PR fixes the intermittent issue I've noticed with enabling prompt embeds and prefix caching simultaneously causes an occasional crash on very long prompts. I haven't figured out why it sometimes happens, because (a) for my particular use case, my work around is just to disable prefix caching or limit the max length, and (b) with the imminent deprecation of v0, I've been focusing on how to port the prompt embeds functionality to v1. That's a different conversation than this PR though.

quanliu1991 · 2025-06-05T16:51:46Z

@heheda12345 Thanks for the review! I’ll work on adding unit tests.

In _schedule_swapped, we don’t call _get_num_new_uncached_and_cached_tokens, so there’s no need to free .

vllm/vllm/core/scheduler.py

Lines 884 to 890 in af7fc84

    
           if (lora_int_id > 0 and (lora_int_id not in curr_loras) 
        
                   and len(curr_loras) >= self.lora_config.max_loras): 
        
               # We don't have a space for another LoRA, so 
        
               # we ignore this request for now. 
        
               leftover_swapped.appendleft(seq_group) 
        
               swapped_queue.popleft() 
        
               continue

In schedule_prefills, there are two branches where free is needed — they call _get_num_new_uncached_and_cached_tokens, and the seq_group stays in the waiting_queue, so need to free.

vllm/vllm/core/scheduler.py

Lines 1143 to 1164 in af7fc84

    
           # We cannot mix sequence groups that use prompt embeds and 
        
           # those that do not. 
        
           if len(seq_groups) == 0: 
        
               using_prompt_embeds = seq_group.uses_prompt_embeds() 
        
           if using_prompt_embeds != seq_group.uses_prompt_embeds(): 
        
               leftover_waiting_sequences.appendleft(seq_group) 
        
               waiting_queue.popleft() 
        
               continue 
        
           lora_int_id = 0 
        
           if self.lora_enabled: 
        
               lora_int_id = seq_group.lora_int_id 
        
               assert curr_loras is not None 
        
               assert self.lora_config is not None 
        
               if (self.lora_enabled and lora_int_id > 0 
        
                       and lora_int_id not in curr_loras 
        
                       and len(curr_loras) >= self.lora_config.max_loras): 
        
                   # We don't have a space for another LoRA, so 
        
                   # we ignore this request for now. 
        
                   leftover_waiting_sequences.appendleft(seq_group) 
        
                   waiting_queue.popleft() 
        
                   continue

That’s a great point. I hadn’t considered the memory leak issue here. You're right, we should free it.
Got it. I’ll do a full check to see if there are any other places where we're missing a free

heheda12345 · 2025-06-05T17:09:37Z

@quanliu1991 Sorry that I marked the wrong line for lora in my question 1. It should be this piece of LORA-related code

vllm/vllm/core/scheduler.py

Lines 1153 to 1164 in af7fc84

    
           if self.lora_enabled: 
        
               lora_int_id = seq_group.lora_int_id 
        
               assert curr_loras is not None 
        
               assert self.lora_config is not None 
        
               if (self.lora_enabled and lora_int_id > 0 
        
                       and lora_int_id not in curr_loras 
        
                       and len(curr_loras) >= self.lora_config.max_loras): 
        
                   # We don't have a space for another LoRA, so 
        
                   # we ignore this request for now. 
        
                   leftover_waiting_sequences.appendleft(seq_group) 
        
                   waiting_queue.popleft() 
        
                   continue

aarnphm

I will wait until Chen's comments to be resolved before merging this in.

… stale num_new_tokens_cached in ComputedBlocksTracker Signed-off-by: 刘全 <[email protected]>

Signed-off-by: 刘全 <[email protected]>

quanliu1991 · 2025-06-08T12:28:07Z

@aarnphm @heheda12345 We have done a comprehensive check of all _get_num_new_uncached_and_cached_tokens call code within the schedule method, and added remove_seq calls in 5 additional branches.

Summary of the 9 branches where remove_seq_from_computed_blocks_tracker() was added:

_schedule_swapped:

vllm/vllm/core/scheduler.py

Lines 900 to 904 in eaa2e51

    
           if num_new_tokens_uncached == 0 or not budget.can_schedule( 
        
                   num_new_tokens=num_new_tokens_uncached, 
        
                   num_new_seqs=num_new_seqs, 
        
           ): 
        
               break

_schedule_priority_preemption:
Added remove_seq for waiting quene.
_schedule_prefills:
Added remove_seq for 3 break branches.
Added remove_seq for 2 continue branches where leftover_waiting_sequences.appendleft(seq_group) is called.
Added remove_seq for 2 branches marked with FINISHED_IGNORED status.

Regarding FINISHED_IGNORED:
We did not find any place where seq_groups marked as FINISHED_IGNORED are freed.
In contrast, we did confirm that running sequences that finish sequences are freed, as are those in _async_stopped.
Therefore, we added a remove_seq call for FINISHED_IGNORED sequences.

On _schedule_running and the No budget => Stop case
I think the blocks used by seq_groups in the running queue won’t be evicted, so I didn’t add a remove_seq call.
But I’m still not sure when the "No budget => Stop" case would occur.

vllm/vllm/core/scheduler.py

Lines 712 to 724 in eaa2e51

    
           num_uncached_new_tokens, _ = \ 
        
               self._get_num_new_uncached_and_cached_tokens( 
        
               seq_group, 
        
               SequenceStatus.RUNNING, 
        
               enable_chunking, 
        
               budget, 
        
               partial_prefill_metadata, 
        
           ) 
        
           num_running_tokens = num_uncached_new_tokens 
        
           if num_running_tokens == 0: 
        
               # No budget => Stop 
        
               break

Signed-off-by: 刘全 <[email protected]>

…ncorrect

heheda12345

LGTM! Thanks for the detailed analysis.
@zh-jp @qthequartermasterman Does this PR fix your problem?

7d1-z · 2025-06-15T13:41:02Z

@heheda12345 Sure! Thanks for your review!

noooop · 2025-06-16T09:40:42Z

@quanliu1991

this pr will break pooling(Embedding, CrossEncoding....) models

from vllm import LLM

model = LLM(model="intfloat/e5-small", task="embed")

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
] * 1000

outputs = model.embed(prompts)

for prompt, output in zip(prompts, outputs):
    pass

Error:

File /share/PycharmProjects/noooop_vllm/vllm/core/scheduler.py:1277, in Scheduler._schedule_default(self)
   1275 # If any requests are swapped, prioritized swapped requests.
   1276 if not self.swapped:
-> 1277     prefills = self._schedule_prefills(budget,
   1278                                        curr_loras,
   1279                                        enable_chunking=False)
   1281 if len(prefills.seq_groups
   1282        ) == 0 and self.scheduler_config.policy == "priority":
   1283     self._schedule_priority_preemption(budget)

File /share/PycharmProjects/noooop_vllm/vllm/core/scheduler.py:1195, in Scheduler._schedule_prefills(self, budget, curr_loras, enable_chunking, partial_prefill_metadata)
   1190 num_new_seqs = seq_group.get_max_num_running_seqs()
   1191 if num_new_tokens_uncached == 0 or not budget.can_schedule(
   1192         num_new_tokens=num_new_tokens_uncached,
   1193         num_new_seqs=num_new_seqs,
   1194 ):
-> 1195     self.remove_seq_from_computed_blocks_tracker(
   1196         seq_group, SequenceStatus.WAITING)
   1197     break
   1199 # Can schedule this request.

File /share/PycharmProjects/noooop_vllm/vllm/core/scheduler.py:1715, in Scheduler.remove_seq_from_computed_blocks_tracker(self, seq_group, status)
   1713 seqs = seq_group.get_seqs(status=status)
   1714 for seq in seqs:
-> 1715     self._remove_seq_from_computed_blocks_tracker(seq)

File /share/PycharmProjects/noooop_vllm/vllm/core/scheduler.py:1722, in Scheduler._remove_seq_from_computed_blocks_tracker(self, seq)
   1717 def _remove_seq_from_computed_blocks_tracker(self, seq: Sequence) -> None:
   1718     """
   1719     Free a sequence computed blocks tracker _seq_id_to_blocks_hashes
   1720     and _seq_id_to_num_tokens_computed.
   1721     """
-> 1722     self.block_manager.remove_seq_from_computed_blocks_tracker(seq)

AttributeError: 'PlaceholderBlockSpaceManager' object has no attribute 'remove_seq_from_computed_blocks_tracker'

PTAL https://buildkite.com/vllm/ci/builds/22110#01977721-962e-413e-982f-5211bbecb3d6

quanliu1991 · 2025-06-16T10:54:00Z

@noooop Sorry, PlaceholderBlockSpaceManager lacks this method. I need to consider the scope of use of this method and make reasonable modifications.

quanliu1991 requested review from alexm-redhat, comaniac, njhill, youkaichao and zhuohan123 as code owners May 30, 2025 12:21

[Bugfix][core] Prefix caching enabled causes incorrect outputs due to…

1451157

… stale num_new_tokens_cached in ComputedBlocksTracker Signed-off-by: 刘全 <[email protected]>

quanliu1991 force-pushed the prefix-caching-enabled-causes-incorrect branch from a4f3fa3 to 1451157 Compare May 30, 2025 13:30

[Misc]Fix typo

962a959

Signed-off-by: 刘全 <[email protected]>

quanliu1991 changed the title ~~[Bugfix][core] Prefix caching enabled causes incorrect outputs~~ [Bugfix][Core] Prefix caching enabled causes incorrect outputs Jun 1, 2025

aarnphm approved these changes Jun 3, 2025

View reviewed changes

DarkLight1337 requested a review from heheda12345 June 4, 2025 02:56

heheda12345 reviewed Jun 5, 2025

View reviewed changes

aarnphm self-requested a review June 5, 2025 22:35

aarnphm requested changes Jun 5, 2025

View reviewed changes

刘全 added 2 commits June 8, 2025 19:32

[Bugfix][core] Prefix caching enabled causes incorrect outputs due to…

1aa30d8

… stale num_new_tokens_cached in ComputedBlocksTracker Signed-off-by: 刘全 <[email protected]>

add unit test for remove_seq_from_computed_blocks_tracker

bd02fd9

Signed-off-by: 刘全 <[email protected]>

quanliu1991 force-pushed the prefix-caching-enabled-causes-incorrect branch 3 times, most recently from e1301f5 to fa5c763 Compare June 11, 2025 02:00

[Misc]Fix typo

e6e04cf

Signed-off-by: 刘全 <[email protected]>

quanliu1991 force-pushed the prefix-caching-enabled-causes-incorrect branch from fa5c763 to e6e04cf Compare June 11, 2025 02:18

Merge branch 'vllm-project:main' into prefix-caching-enabled-causes-i…

5a54113

…ncorrect

heheda12345 approved these changes Jun 15, 2025

View reviewed changes

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 15, 2025

heheda12345 changed the title ~~[Bugfix][Core] Prefix caching enabled causes incorrect outputs~~ [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker Jun 15, 2025

heheda12345 enabled auto-merge (squash) June 15, 2025 13:42

simon-mo disabled auto-merge June 16, 2025 04:56

simon-mo merged commit 92183b4 into vllm-project:main Jun 16, 2025
72 of 76 checks passed

noooop mentioned this pull request Jun 16, 2025

[CI] Add mteb testing for rerank models #19344

Merged

quanliu1991 mentioned this pull request Jun 16, 2025

[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager #19686

Merged

	if (lora_int_id > 0 and (lora_int_id not in curr_loras)
	and len(curr_loras) >= self.lora_config.max_loras):
	# We don't have a space for another LoRA, so
	# we ignore this request for now.
	leftover_swapped.appendleft(seq_group)
	swapped_queue.popleft()
	continue

	# We cannot mix sequence groups that use prompt embeds and
	# those that do not.
	if len(seq_groups) == 0:
	using_prompt_embeds = seq_group.uses_prompt_embeds()
	if using_prompt_embeds != seq_group.uses_prompt_embeds():
	leftover_waiting_sequences.appendleft(seq_group)
	waiting_queue.popleft()
	continue

	def free_seq_cached_tokens(self, seq: Sequence) -> None:
	def remove_seq_from_computed_blocks_tracker(self, seq: Sequence) -> None:

		seq.reset_state_for_recompute()
		self._free_seq_group_cross_attn_blocks(seq_group)

Uh oh!

[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker #18957

[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker #18957

Uh oh!

Conversation

quanliu1991 commented May 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 30, 2025

Uh oh!

aarnphm commented Jun 1, 2025

Uh oh!

quanliu1991 commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

7d1-z commented Jun 3, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Jun 5, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

quanliu1991 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

quanliu1991 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Jun 5, 2025

Uh oh!

heheda12345 commented Jun 5, 2025

Uh oh!

qthequartermasterman commented Jun 5, 2025

Uh oh!

quanliu1991 commented Jun 5, 2025

Uh oh!

heheda12345 commented Jun 5, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

quanliu1991 commented Jun 8, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

7d1-z commented Jun 15, 2025

Uh oh!

Uh oh!

noooop commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quanliu1991 commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

quanliu1991 commented May 30, 2025 •

edited by github-actions bot

Loading

quanliu1991 commented Jun 1, 2025 •

edited

Loading

noooop commented Jun 16, 2025 •

edited

Loading