-
Notifications
You must be signed in to change notification settings - Fork 694
feat: proper local hashes for mockers + router watches endpoints #2132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes transition the worker selection and scheduling infrastructure from using custom endpoint and hash types to more general and precise representations, such as Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Suite
participant Router as KVRouterProcess
participant Mocker as Mocker Engine
Test->>Router: Start with --kv-cache-block-size=BLOCK_SIZE
Test->>Mocker: Launch with config {"block_size": BLOCK_SIZE}
Router->>Router: Use dynamic Instance sources for worker selection
Router->>Mocker: Interact using updated Instance and SequenceHash types
Mocker->>Router: Respond with events containing SequenceHash and block hashes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~18 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/llm/src/kv_router/scheduler.rs (1)
147-191: Consider extracting duplicate instance update logic.The instance update logic is duplicated between the watch channel handler and the no endpoints case. Consider extracting this into a helper closure to improve maintainability.
Add a helper closure at the beginning of the async move block:
tokio::spawn(async move { let mut request: SchedulingRequest; let mut request_rx = request_rx; let mut pending_endpoint_update: Option<Vec<i64>> = None; + + let update_instances = |instances: &mut Vec<Instance>, + instances_rx: &mut tokio::sync::watch::Receiver<Vec<Instance>>, + pending_endpoint_update: &mut Option<Vec<i64>>| { + *instances = instances_rx.borrow_and_update().clone(); + let worker_ids: Vec<i64> = instances.iter().map(|i| i.instance_id).collect(); + *pending_endpoint_update = Some(worker_ids); + }; + tracing::trace!("scheduler background task started");Then use it in both locations:
- _ = instances_rx.changed() => { - instances = instances_rx.borrow_and_update().clone(); - let worker_ids: Vec<i64> = instances.iter().map(|i| i.instance_id).collect(); - pending_endpoint_update = Some(worker_ids); + _ = instances_rx.changed() => { + update_instances(&mut instances, &mut instances_rx, &mut pending_endpoint_update); continue 'outer; }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
components/router/src/main.rs(2 hunks)lib/llm/src/kv_router.rs(5 hunks)lib/llm/src/kv_router/scheduler.rs(9 hunks)lib/llm/src/mocker/protocols.rs(3 hunks)lib/llm/src/mocker/scheduler.rs(6 hunks)lib/llm/src/mocker/sequence.rs(2 hunks)tests/router/test_router_e2e_with_mockers.py(3 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
lib/llm/src/mocker/sequence.rs (1)
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
components/router/src/main.rs (4)
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
lib/llm/src/mocker/scheduler.rs (3)
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: jthomson04
PR: #1363
File: lib/llm/src/block_manager/block/transfer.rs:206-216
Timestamp: 2025-06-04T18:43:04.566Z
Learning: For NIXL transfers in the KVBM system, the future returned by nixl::write_blocks_to is independent of the underlying transfer execution. The transfer begins immediately when nixl::write_blocks_to is called, and the returned future is only used for notification/completion tracking. Therefore, it's safe to drop the future when notification is not needed (notify == false).
lib/llm/src/kv_router/scheduler.rs (7)
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: jthomson04
PR: #1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the wait_for_key_count function correctly uses exact equality (==) instead of greater-than-or-equal (>=) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
lib/llm/src/mocker/protocols.rs (2)
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/llm/src/kv_router.rs (8)
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (22)
lib/llm/src/mocker/sequence.rs (2)
93-93: LGTM: Implements proper local hash computation with required salt value.This change correctly adds the salt hash value of 1337 as specified in the PR objectives, ensuring the mocker computes the actual
local_block_hashesthat the KV router expects.
127-133: LGTM: Proper interface for accessing computed block hashes.The new
block_hashes()method correctly provides access to the block hashes computed with the salt value, enabling integration with the scheduler and KV cache event processing as described in the AI summary.tests/router/test_router_e2e_with_mockers.py (3)
21-21: LGTM: Centralized block size configuration for tests.The new
BLOCK_SIZEconstant provides consistent block size configuration across the test infrastructure.
63-64: LGTM: Proper block size configuration for KV router.Correctly passes the block size configuration to the KV router process, ensuring consistent block management across the system.
106-106: LGTM: Consistent block size configuration for mockers.Ensures mocker engines use the same block size as the KV router, which is essential for proper block hash computation and cache coherency.
components/router/src/main.rs (2)
32-32: LGTM: Updated import aligns with architectural improvements.The transition from
ProcessedEndpointstoInstancesupports the broader refactoring towards more general runtime component instance handling, as described in the AI summary.
89-89: LGTM: Interface updated for Instance-based worker selection.The parameter type change from
&ProcessedEndpointsto&[Instance]correctly updates the interface while preserving the delegation logic toDefaultWorkerSelector.lib/llm/src/kv_router.rs (4)
37-37: LGTM: Removes dependency on EndpointCollector polling mechanism.Commenting out the
EndpointCollectorimport aligns with the PR objective to eliminate the 100ms polling delay in favor of direct endpoint watching.
46-46: LGTM: Adds necessary import for Instance-based architecture.The
Instanceimport supports the transition fromProcessedEndpointsto more general instance handling throughout the system.
59-59: LGTM: Updates trait interface for Instance-based worker selection.The
WorkerSelectortrait signature change enforces the architectural transition fromProcessedEndpointstoInstanceacross all implementations.
156-164: LGTM: Implements direct endpoint watching for improved responsiveness.This change successfully replaces the
EndpointCollector100ms polling mechanism with direct client instance source watching, achieving the PR objective of instant endpoint updates. The panic for static instance sources is appropriate since KV routing requires dynamic instances.lib/llm/src/mocker/scheduler.rs (6)
50-50: LGTM: Adds necessary import for block hash tracking.The
BlockHashimport supports the new block hash propagation functionality throughout the scheduler.
127-129: LGTM: Extends try_prefill to return block hash information.The method signature update correctly adds
Vec<BlockHash>to the return tuple, enabling block hash tracking during prefill operations. The comment is appropriately updated to document the new return value.
184-184: LGTM: Extracts and returns computed block hashes.Correctly adds
sequence.block_hashes()to the return tuple, providing access to the block hashes computed with the salt value for KV cache event correlation.
407-412: LGTM: Updates call site to handle block hash return value.The destructuring correctly handles the new
block_hashesreturn value from the updatedtry_prefillmethod signature.
431-432: LGTM: Correlates prefill KV events with proper local block hashes.Correctly passes
block_hashestoblock_response_to_kv_event, ensuring KV cache events during prefill are correlated with the local block hashes computed with the salt value.
471-472: LGTM: Correlates decode KV events with proper local block hashes.Correctly passes
sequence.block_hashes()toblock_response_to_kv_eventduring decode operations, ensuring consistent block hash correlation throughout the request lifecycle.lib/llm/src/kv_router/scheduler.rs (3)
33-33: Import addition looks good.The Instance import is properly placed and necessary for the type transition from ProcessedEndpoints.
110-131: Instance type migration handled correctly.The transition from ProcessedEndpoints to Vec is properly implemented, with correct extraction of worker IDs.
362-414: Worker selection logic correctly adapted to Instance type.The DefaultWorkerSelector implementation properly handles the transition from ProcessedEndpoints to Instance, maintaining the same selection logic while using instance_id as the worker identifier.
lib/llm/src/mocker/protocols.rs (2)
27-44: Type refinement from GlobalHash to SequenceHash improves type safety.The migration to SequenceHash provides better semantic clarity and type safety for block hash handling.
224-265: Enhanced hash correlation logic properly maps sequence to local hashes.The updated function correctly correlates external sequence hashes with local token hashes by assuming stored blocks correspond to the tail of the local_hashes array. The fail-fast behavior with explicit panic aligns with the codebase's error handling preferences.
The bounds checking at lines 241-244 ensures data integrity by panicking early if assumptions are violated.
Overview:
Somewhat of a 2-in-1 PR. Changes:
mockernow computes the actuallocal_block_hashesthe Kv router expects (with salt hash of1337)KvRouternow directly watches the "generate" endpoints usingClientfor instant updates of endpoint changes, instead of the previousEndpointCollector, which incurred a 100ms polling interval delayCloses #2095
Summary by CodeRabbit
New Features
Bug Fixes
Refactor