-
Notifications
You must be signed in to change notification settings - Fork 867
fix: update can_append and may_append logic for block allocation #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hello, I have some questions about this PR and would appreciate your clarification:
|
Here’s the translation in Markdown format for a GitHub comment reply: Hi,your observation is correct—the old logic indeed followed an "allocate-on-demand" approach, and the system worked fine. However, the core purpose of this modification is to make the scheduling logic more robust, atomic, and easier to understand, addressing a potential state inconsistency issue in the old logic. Below, I’ll address the two questions in detail. Question 1: Is the Your Viewpoint: Analysis and Answer: Your observation is mostly correct. In the current code flow (the However, adding the
Conclusion: Question 2: Was the old "allocate-on-demand" logic necessary to change? Your Viewpoint: Analysis and Answer: This is the core of the issue. The questioner accurately describes how the old logic worked. However, the old logic had a subtle but critical design flaw: the non-atomicity of scheduling decisions and resource allocation. Let’s compare the old and new logic using a complete scheduling cycle. Assume Old Logic Workflow (When
Problem with the Old Logic: New Logic Workflow (When
Advantages of the New Logic:
This ensures that any sequence sent for execution by the scheduler has all its physical resources ready. It eliminates the timing gap and state inconsistency between "decision" and "resource guarantee" in the old logic, making the scheduler's behavior more predictable and robust. |
Thank you for your question~ @xiaohajiayou |
Thank you so much for your detailed explanation. Both methods seem to be feasible, though there is a slight difference in how they enhance readability. For more details, please refer to #68. |
@GeeeekExplorer PTAL~ |
Fix: Correct off-by-one error in KV-Cache block allocation
This pull request addresses a critical off-by-one error in the BlockManager's logic for allocating new KV-Cache blocks during the decoding phase. The fix ensures that block
allocation is timely and atomic, preventing potential errors and ensuring the stability of the generation process.
The Problem
The previous implementation had a logical flaw in the can_append and may_append methods, which manifested in two ways:
Delayed Allocation Check: The condition len(seq) % self.block_size == 1 in can_append was incorrect. It checked for the need to allocate a new block after the first token of
that new block had already been generated. This is too late and represents a classic off-by-one error in state management. The check should occur precisely when the last block
has just been filled.
Fragmented Logic: The hashing of a completed block and the allocation of a new block were handled at different times. The hash was computed when len(seq) % self.block_size ==
0, but the new block was only allocated when len(seq) % self.block_size == 1. These two actions should be part of a single, atomic operation that occurs when a block is
finalized.
This flawed logic could lead to incorrect state or errors when a sequence's length crossed a block_size boundary during decoding.
The Solution
This PR corrects the logic to be proactive and atomic:
Corrected Condition: The condition in can_append and may_append is now len(seq) > 0 and len(seq) % self.block_size == 0. This accurately identifies the exact moment a block
has been completely filled.
Atomic Operation: The logic within may_append has been consolidated. Now, when the condition is met, the following actions are performed in a single step:
This ensures that a sequence always has a ready and available block before the next token's KV-Cache needs to be written.
Verification Process
The correctness of this change was verified through the following steps:
Logical Analysis: A thorough review of the BlockManager's state machine was conducted, focusing on the lifecycle of a Sequence as it grows during decoding. The analysis
confirmed that the original logic failed to handle the boundary condition of a block becoming full.
Critical Scenario Simulation: The primary test case considered was a sequence growing to a length that is an exact multiple of block_size (e.g., len(seq) == 256).
memory location to write the new token's KV-Cache.
ensures the decoding process can continue seamlessly and correctly.
This change is a crucial bug fix that enhances the robustness and correctness of the KV-Cache management system. I kindly request a review and approval for this pull request.