Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Jul 26, 2025

Overview:

Somewhat of a 2-in-1 PR. Changes:

  1. mocker now computes the actual local_block_hashes the Kv router expects (with salt hash of 1337)
  2. KvRouter now directly watches the "generate" endpoints using Client for instant updates of endpoint changes, instead of the previous EndpointCollector, which incurred a 100ms polling interval delay

Closes #2095

Summary by CodeRabbit

  • New Features

    • Added public method to retrieve block hashes from active sequences.
    • Block size configuration is now propagated to both the KV router and mocker engine in end-to-end tests.
  • Bug Fixes

    • Improved type specificity and correctness when handling block hashes in cache events.
  • Refactor

    • Worker selection and scheduling now operate on a more general instance representation, improving flexibility and maintainability.
    • Enhanced propagation of block hash information during scheduler operations for better event correlation.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 26, 2025

Walkthrough

The changes transition the worker selection and scheduling infrastructure from using custom endpoint and hash types to more general and precise representations, such as Instance and SequenceHash. Method signatures, internal logic, and data propagation are updated accordingly across router, scheduler, mocker, and test components. Block size configuration is now explicitly managed in tests.

Changes

File(s) Change Summary
components/router/src/main.rs Replaced ProcessedEndpoints with [Instance] in worker selection; updated imports and method signatures.
lib/llm/src/kv_router.rs Removed EndpointCollector usage; now uses dynamic instance sources; updated WorkerSelector signatures.
lib/llm/src/kv_router/scheduler.rs Replaced ProcessedEndpoints with Vec<Instance> throughout; updated scheduler and selector logic and signatures.
lib/llm/src/mocker/protocols.rs Replaced GlobalHash with SequenceHash; updated enum variants and function signatures; improved hash handling.
lib/llm/src/mocker/scheduler.rs Propagated block hashes in prefill/decode phases; updated method signatures and event forwarding.
lib/llm/src/mocker/sequence.rs Added block_hashes method to ActiveSequence; initialized token sequence with fixed seed.
tests/router/test_router_e2e_with_mockers.py Added BLOCK_SIZE constant; passed block size to router and mockers via CLI and config.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Suite
    participant Router as KVRouterProcess
    participant Mocker as Mocker Engine

    Test->>Router: Start with --kv-cache-block-size=BLOCK_SIZE
    Test->>Mocker: Launch with config {"block_size": BLOCK_SIZE}
    Router->>Router: Use dynamic Instance sources for worker selection
    Router->>Mocker: Interact using updated Instance and SequenceHash types
    Mocker->>Router: Respond with events containing SequenceHash and block hashes
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~18 minutes

Poem

In fields of code where hashes hop,
Old endpoints fade, new instances pop.
Blocks and sequences, hashes align,
Tests now pass with block size fine.
The router and mocker, in sync they play,
A rabbit’s joy—code hops today! 🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/llm/src/kv_router/scheduler.rs (1)

147-191: Consider extracting duplicate instance update logic.

The instance update logic is duplicated between the watch channel handler and the no endpoints case. Consider extracting this into a helper closure to improve maintainability.

Add a helper closure at the beginning of the async move block:

 tokio::spawn(async move {
     let mut request: SchedulingRequest;
     let mut request_rx = request_rx;
     let mut pending_endpoint_update: Option<Vec<i64>> = None;
+    
+    let update_instances = |instances: &mut Vec<Instance>, 
+                           instances_rx: &mut tokio::sync::watch::Receiver<Vec<Instance>>, 
+                           pending_endpoint_update: &mut Option<Vec<i64>>| {
+        *instances = instances_rx.borrow_and_update().clone();
+        let worker_ids: Vec<i64> = instances.iter().map(|i| i.instance_id).collect();
+        *pending_endpoint_update = Some(worker_ids);
+    };
+    
     tracing::trace!("scheduler background task started");

Then use it in both locations:

-                    _ = instances_rx.changed() => {
-                        instances = instances_rx.borrow_and_update().clone();
-                        let worker_ids: Vec<i64> = instances.iter().map(|i| i.instance_id).collect();
-                        pending_endpoint_update = Some(worker_ids);
+                    _ = instances_rx.changed() => {
+                        update_instances(&mut instances, &mut instances_rx, &mut pending_endpoint_update);
                         continue 'outer;
                     }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4498a77 and 13f4ed4.

📒 Files selected for processing (7)
  • components/router/src/main.rs (2 hunks)
  • lib/llm/src/kv_router.rs (5 hunks)
  • lib/llm/src/kv_router/scheduler.rs (9 hunks)
  • lib/llm/src/mocker/protocols.rs (3 hunks)
  • lib/llm/src/mocker/scheduler.rs (6 hunks)
  • lib/llm/src/mocker/sequence.rs (2 hunks)
  • tests/router/test_router_e2e_with_mockers.py (3 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
lib/llm/src/mocker/sequence.rs (1)

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

components/router/src/main.rs (4)

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

lib/llm/src/mocker/scheduler.rs (3)

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: jthomson04
PR: #1363
File: lib/llm/src/block_manager/block/transfer.rs:206-216
Timestamp: 2025-06-04T18:43:04.566Z
Learning: For NIXL transfers in the KVBM system, the future returned by nixl::write_blocks_to is independent of the underlying transfer execution. The transfer begins immediately when nixl::write_blocks_to is called, and the returned future is only used for notification/completion tracking. Therefore, it's safe to drop the future when notification is not needed (notify == false).

lib/llm/src/kv_router/scheduler.rs (7)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Learnt from: jthomson04
PR: #1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the wait_for_key_count function correctly uses exact equality (==) instead of greater-than-or-equal (>=) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

lib/llm/src/mocker/protocols.rs (2)

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/llm/src/kv_router.rs (8)

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (22)
lib/llm/src/mocker/sequence.rs (2)

93-93: LGTM: Implements proper local hash computation with required salt value.

This change correctly adds the salt hash value of 1337 as specified in the PR objectives, ensuring the mocker computes the actual local_block_hashes that the KV router expects.


127-133: LGTM: Proper interface for accessing computed block hashes.

The new block_hashes() method correctly provides access to the block hashes computed with the salt value, enabling integration with the scheduler and KV cache event processing as described in the AI summary.

tests/router/test_router_e2e_with_mockers.py (3)

21-21: LGTM: Centralized block size configuration for tests.

The new BLOCK_SIZE constant provides consistent block size configuration across the test infrastructure.


63-64: LGTM: Proper block size configuration for KV router.

Correctly passes the block size configuration to the KV router process, ensuring consistent block management across the system.


106-106: LGTM: Consistent block size configuration for mockers.

Ensures mocker engines use the same block size as the KV router, which is essential for proper block hash computation and cache coherency.

components/router/src/main.rs (2)

32-32: LGTM: Updated import aligns with architectural improvements.

The transition from ProcessedEndpoints to Instance supports the broader refactoring towards more general runtime component instance handling, as described in the AI summary.


89-89: LGTM: Interface updated for Instance-based worker selection.

The parameter type change from &ProcessedEndpoints to &[Instance] correctly updates the interface while preserving the delegation logic to DefaultWorkerSelector.

lib/llm/src/kv_router.rs (4)

37-37: LGTM: Removes dependency on EndpointCollector polling mechanism.

Commenting out the EndpointCollector import aligns with the PR objective to eliminate the 100ms polling delay in favor of direct endpoint watching.


46-46: LGTM: Adds necessary import for Instance-based architecture.

The Instance import supports the transition from ProcessedEndpoints to more general instance handling throughout the system.


59-59: LGTM: Updates trait interface for Instance-based worker selection.

The WorkerSelector trait signature change enforces the architectural transition from ProcessedEndpoints to Instance across all implementations.


156-164: LGTM: Implements direct endpoint watching for improved responsiveness.

This change successfully replaces the EndpointCollector 100ms polling mechanism with direct client instance source watching, achieving the PR objective of instant endpoint updates. The panic for static instance sources is appropriate since KV routing requires dynamic instances.

lib/llm/src/mocker/scheduler.rs (6)

50-50: LGTM: Adds necessary import for block hash tracking.

The BlockHash import supports the new block hash propagation functionality throughout the scheduler.


127-129: LGTM: Extends try_prefill to return block hash information.

The method signature update correctly adds Vec<BlockHash> to the return tuple, enabling block hash tracking during prefill operations. The comment is appropriately updated to document the new return value.


184-184: LGTM: Extracts and returns computed block hashes.

Correctly adds sequence.block_hashes() to the return tuple, providing access to the block hashes computed with the salt value for KV cache event correlation.


407-412: LGTM: Updates call site to handle block hash return value.

The destructuring correctly handles the new block_hashes return value from the updated try_prefill method signature.


431-432: LGTM: Correlates prefill KV events with proper local block hashes.

Correctly passes block_hashes to block_response_to_kv_event, ensuring KV cache events during prefill are correlated with the local block hashes computed with the salt value.


471-472: LGTM: Correlates decode KV events with proper local block hashes.

Correctly passes sequence.block_hashes() to block_response_to_kv_event during decode operations, ensuring consistent block hash correlation throughout the request lifecycle.

lib/llm/src/kv_router/scheduler.rs (3)

33-33: Import addition looks good.

The Instance import is properly placed and necessary for the type transition from ProcessedEndpoints.


110-131: Instance type migration handled correctly.

The transition from ProcessedEndpoints to Vec is properly implemented, with correct extraction of worker IDs.


362-414: Worker selection logic correctly adapted to Instance type.

The DefaultWorkerSelector implementation properly handles the transition from ProcessedEndpoints to Instance, maintaining the same selection logic while using instance_id as the worker identifier.

lib/llm/src/mocker/protocols.rs (2)

27-44: Type refinement from GlobalHash to SequenceHash improves type safety.

The migration to SequenceHash provides better semantic clarity and type safety for block hash handling.


224-265: Enhanced hash correlation logic properly maps sequence to local hashes.

The updated function correctly correlates external sequence hashes with local token hashes by assuming stored blocks correspond to the tail of the local_hashes array. The fail-fast behavior with explicit panic aligns with the codebase's error handling preferences.

The bounds checking at lines 241-244 ensures data integrity by panicking early if assumptions are violated.

@PeaBrane PeaBrane enabled auto-merge (squash) July 28, 2025 17:56
@PeaBrane PeaBrane merged commit 803bfa8 into main Jul 28, 2025
11 of 12 checks passed
@PeaBrane PeaBrane deleted the rupei/mocker-local-hashes branch July 28, 2025 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Router should watch endpoints not poll them

3 participants