Skip to content

Conversation

skyloevil
Copy link

Fix: Correct off-by-one error in KV-Cache block allocation

This pull request addresses a critical off-by-one error in the BlockManager's logic for allocating new KV-Cache blocks during the decoding phase. The fix ensures that block
allocation is timely and atomic, preventing potential errors and ensuring the stability of the generation process.

The Problem

The previous implementation had a logical flaw in the can_append and may_append methods, which manifested in two ways:

  1. Delayed Allocation Check: The condition len(seq) % self.block_size == 1 in can_append was incorrect. It checked for the need to allocate a new block after the first token of
    that new block had already been generated. This is too late and represents a classic off-by-one error in state management. The check should occur precisely when the last block
    has just been filled.

  2. Fragmented Logic: The hashing of a completed block and the allocation of a new block were handled at different times. The hash was computed when len(seq) % self.block_size ==
    0, but the new block was only allocated when len(seq) % self.block_size == 1. These two actions should be part of a single, atomic operation that occurs when a block is
    finalized.

This flawed logic could lead to incorrect state or errors when a sequence's length crossed a block_size boundary during decoding.

The Solution

This PR corrects the logic to be proactive and atomic:

  1. Corrected Condition: The condition in can_append and may_append is now len(seq) > 0 and len(seq) % self.block_size == 0. This accurately identifies the exact moment a block
    has been completely filled.

  2. Atomic Operation: The logic within may_append has been consolidated. Now, when the condition is met, the following actions are performed in a single step:

    • The just-filled block has its hash computed and stored for prefix caching.
    • A new, empty block is immediately allocated from the free list.
    • This new block is appended to the sequence's block_table.

This ensures that a sequence always has a ready and available block before the next token's KV-Cache needs to be written.

Verification Process

The correctness of this change was verified through the following steps:

  1. Logical Analysis: A thorough review of the BlockManager's state machine was conducted, focusing on the lifecycle of a Sequence as it grows during decoding. The analysis
    confirmed that the original logic failed to handle the boundary condition of a block becoming full.

  2. Critical Scenario Simulation: The primary test case considered was a sequence growing to a length that is an exact multiple of block_size (e.g., len(seq) == 256).

    • Before Fix: In this scenario, the original code would have proceeded to the next generation step without a new block allocated, leading to a state where there is no valid
      memory location to write the new token's KV-Cache.
    • After Fix: The corrected logic correctly identifies this boundary condition. It finalizes the full block by computing its hash and immediately allocates a new block. This
      ensures the decoding process can continue seamlessly and correctly.

This change is a crucial bug fix that enhances the robustness and correctness of the KV-Cache management system. I kindly request a review and approval for this pull request.

@xiaohajiayou
Copy link
Contributor

xiaohajiayou commented Jul 7, 2025

Hello, I have some questions about this PR and would appreciate your clarification:

  1. The sequence is initially initialized with the length of the prompt, and tokens are appended later. Is the check len(seq) > 0 necessary?

  2. In the current scheduler.step() function, the model_runner.call is invoked after the prefill phase to generate new tokens, which are then appended to seq.token_ids via postprocess. If the new token in the current step exceeds the block size, the update to the number of KV cache blocks (triggered by len(seq) % block_size == 1) just only need in the next step. This logic for on-demand block allocation in the current step seems no need to change?

@skyloevil
Copy link
Author

Hello, I have some questions about this PR and would appreciate your clarification:

  1. The sequence is initially initialized with the length of the prompt, and tokens are appended later. Is the check len(seq) > 0 necessary?
  2. In the current scheduler.step() function, the model_runner.call is invoked after the prefill phase to generate new tokens, which are then appended to seq.token_ids via postprocess. If the new token in the current step exceeds the block size, the update to the number of KV cache blocks (triggered by len(seq) % block_size == 1) just only need in the next step. This logic for on-demand block allocation in the current step seems no need to change?

Here’s the translation in Markdown format for a GitHub comment reply:


Hi,your observation is correct—the old logic indeed followed an "allocate-on-demand" approach, and the system worked fine. However, the core purpose of this modification is to make the scheduling logic more robust, atomic, and easier to understand, addressing a potential state inconsistency issue in the old logic.

Below, I’ll address the two questions in detail.


Question 1: Is the len(seq) > 0 check necessary?

Your Viewpoint:
The sequence (Sequence) is always initialized with a prompt, so its length will never be 0. This check seems redundant.

Analysis and Answer:

Your observation is mostly correct. In the current code flow (the add_request method in llm_engine.py), the sequence is indeed created with a non-empty prompt, so len(seq) is almost never 0 when first entering the scheduler.

However, adding the len(seq) > 0 check is a best practice in defensive programming, offering the following benefits:

  1. Clarity of Intent:
    It clearly indicates that the block allocation logic should only trigger when the sequence is non-empty. For an empty sequence, 0 % block_size == 0 would hold true. Without the len(seq) > 0 guard, an empty sequence could erroneously satisfy the condition for allocating a new block, which makes no logical sense.

  2. Robustness:
    While the current code path ensures the sequence is non-empty, future refactoring or feature additions (e.g., supporting sequences that start entirely from scratch) might introduce empty sequences. This check prevents potential edge-case bugs that could be hard to debug later.

  3. Readability:
    It makes it immediately clear to readers that this branch handles sequences with content, improving code readability.

Conclusion:
Although len(seq) is currently always greater than 0, this check is not redundant. It’s a low-cost, high-reward measure that enhances robustness and guards against future changes.


Question 2: Was the old "allocate-on-demand" logic necessary to change?

Your Viewpoint:
In the current scheduler.step() flow, tokens are appended via postprocess after model_runner.call. If the token generated in the current step fills up a block, the need to allocate a new block for the next token could simply be handled in the next scheduling step. Thus, the old logic (allocating only when len(seq) % block_size == 1) seems reasonable, and the change appears unnecessary.

Analysis and Answer:

This is the core of the issue. The questioner accurately describes how the old logic worked. However, the old logic had a subtle but critical design flaw: the non-atomicity of scheduling decisions and resource allocation.

Let’s compare the old and new logic using a complete scheduling cycle. Assume block_size = 4.

Old Logic Workflow (When len(seq) goes from 4 to 5):

  1. Scheduling Cycle N:

    • A sequence seq has len(seq) = 4, exactly filling its last block.
    • scheduler.schedule() is called, entering the decode phase.
    • can_append(seq) is called: 4 % 4 == 1 is False. The check passes.
    • may_append(seq) is called: 4 % 4 == 0 is True. The code enters this branch, computes and updates the hash, but does not allocate a new block.
    • The scheduler marks seq as runnable, adds it to scheduled_seqs, and returns.
    • llm_engine.step() calls model_runner, generating the 5th token.
    • scheduler.postprocess() is called, seq.append_token() executes, and len(seq) becomes 5.
    • At this point, seq is in an "illegal" state: its length is 5, but its block_table still only points to blocks that can hold 4 tokens.
  2. Scheduling Cycle N+1:

    • scheduler.schedule() is called again.
    • can_append(seq) is called: len(seq) = 5, so 5 % 4 == 1 is True. Now, it checks len(self.free_block_ids) >= 1—the first time the system verifies if resources are available for the 5th token.
    • If the check passes, may_append(seq) is called, and since 5 % 4 == 1 is True, a new block is allocated.

Problem with the Old Logic:
The scheduler made the "run seq" decision in cycle N without ensuring physical memory (KV Cache Block) was reserved for the upcoming token. Resource checks and allocations were deferred to cycle N+1. This violates an important principle: scheduling decisions must ensure all required resources are ready before execution.

New Logic Workflow (When len(seq) goes from 4 to 5):

  1. Scheduling Cycle N:
    • A sequence seq has len(seq) = 4.
    • scheduler.schedule() is called, entering the decode phase.
    • can_append(seq) is called: len(seq) > 0 and 4 % 4 == 0 is True. It immediately checks len(self.free_block_ids) >= 1, confirming resource availability before deciding to run seq.
    • If the check passes, may_append(seq) is called, and since 4 % 4 == 0 is True, it atomically does two things:
      1. Computes the hash of the old block.
      2. Allocates a new block and appends it to seq.block_table.
    • Now, seq.block_table has a new, available block.
    • The scheduler adds seq to scheduled_seqs and returns.
    • llm_engine.step() calls model_runner, generating the 5th token.
    • scheduler.postprocess() appends the token, and len(seq) becomes 5. The new token is safely written to the pre-allocated block.

Advantages of the New Logic:
The new logic follows a "check-lock-execute" pattern:

  • Check: can_append verifies if a new block is needed and if one is available.
  • Lock/Allocate: may_append immediately allocates the block, "reserving" resources.
  • Execute: model_runner generates the token.

This ensures that any sequence sent for execution by the scheduler has all its physical resources ready. It eliminates the timing gap and state inconsistency between "decision" and "resource guarantee" in the old logic, making the scheduler's behavior more predictable and robust.

@skyloevil
Copy link
Author

skyloevil commented Jul 7, 2025

Thank you for your question~ @xiaohajiayou

@xiaohajiayou
Copy link
Contributor

Thank you so much for your detailed explanation. Both methods seem to be feasible, though there is a slight difference in how they enhance readability. For more details, please refer to #68.

@skyloevil
Copy link
Author

@GeeeekExplorer PTAL~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants