Skip to content

Conversation

@jorgeantonio21
Copy link
Contributor

@jorgeantonio21 jorgeantonio21 commented Jul 29, 2025

Overview:

This PR implements a comprehensive request throttler system for the HTTP service that automatically activates when all KV router workers are busy. The system includes a configurable queue depth, NATS event-driven request throttling, and graceful degradation when request throttling is unavailable.

Details:

Core Request Throttler Implementation:

  • Added HttpServiceRequestThrottler with configurable rejection time window
  • Implemented RequestThrottlerState with automatic expiration logic
  • Created NATS event subscription for "all workers busy" events from KV router scheduler
  • Added request throttling checks to all HTTP endpoints (chat completions, completions, embeddings, responses)

HTTP Service Integration:

  • Extended HttpServiceConfig with all_workers_busy_rejection_time_window parameter
  • Updated State struct to include request throttler in shared service state
  • Modified all request handlers to check request throttling before processing requests
  • Added proper error responses (429 Too Many Requests) when request throttled

Scheduler Queue Implementation:

  • Enhanced KV router scheduler with configurable queue depth (max_workers_busy_queue_depth)
  • Added NATS event publishing when queue reaches maximum depth
  • Implemented KVAllWorkersBusyEvent for communicating service status
  • Added automatic queue clearing after configurable duration

Service Lifecycle Management:

  • Integrated request throttler monitoring into HTTP service startup
  • Added graceful error handling - service starts even if request throttling fails
  • Proper separation of concerns between model discovery and request throttling
  • Updated model watcher to remove request throttling responsibilities

Configuration & Documentation:

  • Updated Python bindings to support new request throttling configuration
  • Enhanced router README with queue depth documentation
  • Added proper error types and logging throughout the system

Where should the reviewer start?

Primary files to review:

  • lib/llm/src/http/service/request_throttler.rs - Core request throttling implementation
  • lib/llm/src/http/service/service_v2.rs - HTTP service integration
  • lib/llm/src/http/service/openai.rs - Request handler request throttling checks
  • lib/llm/src/kv_router/scheduler.rs - Queue implementation and NATS events
  • lib/llm/src/entrypoint/input/http.rs - Service startup and lifecycle management

Secondary files:

  • lib/bindings/python/rust/llm/entrypoint.rs - Python binding updates
  • docs/components/router/README.md - Documentation updates
  • lib/llm/src/discovery/watcher.rs - Cleaned up model watcher

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features

    • Added configurable queue depth for handling requests when all workers are busy, allowing users to set a maximum waiting queue size.
    • Introduced a configurable time window for rejecting new requests with a 429 error when all workers are busy.
    • Implemented event-driven request throttling, automatically rejecting requests during periods of high load based on system events.
    • Added Prometheus metrics to track request-throttled requests for improved monitoring and observability.
  • Documentation

    • Updated documentation to explain new queue management and request throttling options, including configuration flags and event-driven behaviors.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi jorgeantonio21! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added external-contribution Pull request is from an external contributor feat labels Jul 29, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 29, 2025

Walkthrough

This change introduces configurable queue depth and rate limiting for the KV router and HTTP service. New command-line flags and configuration parameters allow limiting the number of queued requests when all workers are busy, trigger event-based rate limiting, and propagate these settings through the frontend, router, engine, and HTTP layers. The HTTP service now enforces rate limiting, exposes new metrics, and documents the behavior.

Changes

Cohort / File(s) Change Summary
Frontend CLI Flags & Propagation
components/frontend/src/dynamo/frontend/main.py
Added --max-workers-busy-queue-depth and --all-workers-busy-rejection-time-window CLI flags; propagated to router and engine config.
Router CLI Flags & Propagation
components/router/src/main.rs, launch/dynamo-run/src/flags.rs
Added max_workers_busy_queue_depth CLI flag (default 5) and all_workers_busy_rejection_time_window flag; passed to router config.
Router Config & Scheduler
lib/llm/src/kv_router.rs, lib/llm/src/kv_router/scheduler.rs, lib/bindings/python/rust/llm/entrypoint.rs, lib/llm/src/discovery/model_manager.rs
Added max_workers_busy_queue_depth config; KvScheduler now manages a bounded waiting queue, emits "all workers busy" events, and prioritizes queued requests.
Engine & Local Model
lib/llm/src/local_model.rs, launch/dynamo-run/src/lib.rs
Added all_workers_busy_rejection_time_window to model builder and model; propagated from CLI/config.
HTTP Service Rate Limiting
lib/llm/src/http/service.rs, lib/llm/src/http/service/service_v2.rs, lib/llm/src/http/service/rate_limiter.rs, lib/llm/src/entrypoint/input/http.rs
Introduced HttpServiceRateLimiter module; HTTP service now monitors "all workers busy" events and enforces rate limiting with configurable time window.
OpenAI API Handlers
lib/llm/src/http/service/openai.rs
Centralized rate limiting checks; handlers return 429 errors when rate limited; new helper and error response.
HTTP Service Metrics
lib/llm/src/http/service/metrics.rs
Added Prometheus metric for rate-limited requests, with increment and accessor methods.
Documentation
docs/components/router/README.md
Documented queue management, new flags, event publishing, and HTTP service rate limiting.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Frontend
    participant Router
    participant Scheduler
    participant Worker
    participant HTTPService
    participant RateLimiter

    Client->>Frontend: Submit request
    Frontend->>Router: Forward request
    Router->>Scheduler: Schedule request
    alt All workers busy
        Scheduler->>Scheduler: Queue request (up to max depth)
        alt Queue full
            Scheduler->>Router: Publish "all workers busy" event
            Router->>HTTPService: Emit event
            HTTPService->>RateLimiter: Activate rate limiting (for time window)
            Client->>HTTPService: New request
            HTTPService->>RateLimiter: Check rate limit
            alt Rate limited
                HTTPService-->>Client: 429 Too Many Requests
            else Not rate limited
                HTTPService->>Scheduler: Forward request
            end
        else Queue not full
            Scheduler->>Scheduler: Wait, prioritize queued requests
        end
    else Worker available
        Scheduler->>Worker: Assign request
        Worker-->>Scheduler: Processed
        Scheduler-->>Router: Complete
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Poem

A rabbit in queues, with ears to the ground,
Hops through the routers where limits are found.
When workers are busy and queues overflow,
Events send a warning: "Too many, oh no!"
The service now listens, rate limits in place—
429s hopping, but never a race.
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
lib/llm/src/http/service/rate_limiter.rs (1)

37-49: Consider automatically clearing expired rate limit state.

The current implementation checks expiration but doesn't clear the expired state, which could lead to inconsistent internal state where is_rate_limited remains true but the method returns false.

Consider this approach to maintain consistent state:

 pub fn is_rate_limited(&self) -> bool {
     if !self.is_rate_limited {
         return false;
     }

     // Check if rate limit duration has expired
     if let Some(start_time) = self.rate_limit_start {
         if start_time.elapsed() > self.rate_limit_duration {
-            return false;
+            // State has expired, but we can't mutate here due to immutable reference
+            // Consider adding a separate method or changing this to take &mut self
+            return false;
         }
     }
     true
 }

Alternatively, add a separate check_and_clear_expired() method that callers can use to maintain consistent state.

lib/llm/src/entrypoint/input/http.rs (1)

121-132: Consider simplifying the function signature.

The function could be simplified by passing only the necessary data instead of the entire EngineConfig:

-fn start_rate_limiter_monitoring(
-    runtime: &DistributedRuntime,
-    engine_config: &EngineConfig,
-    rate_limiter: &HttpServiceRateLimiter,
-) -> anyhow::Result<()> {
-    let namespace = engine_config.local_model().endpoint_id().namespace.clone();
-    let namespace = runtime.namespace(namespace)?;
-    rate_limiter.start_monitoring(namespace)?;
-    Ok(())
-}
+fn start_rate_limiter_monitoring(
+    runtime: &DistributedRuntime,
+    namespace_name: &str,
+    rate_limiter: &HttpServiceRateLimiter,
+) -> anyhow::Result<()> {
+    let namespace = runtime.namespace(namespace_name.to_string())?;
+    rate_limiter.start_monitoring(namespace)?;
+    Ok(())
+}

Then update the call site:

-if let Err(e) = start_rate_limiter_monitoring(&distributed_runtime, &engine_config, &rate_limiter) {
+let namespace_name = engine_config.local_model().endpoint_id().namespace.as_str();
+if let Err(e) = start_rate_limiter_monitoring(&distributed_runtime, namespace_name, &rate_limiter) {
lib/llm/src/kv_router/scheduler.rs (1)

251-251: Consider making the retry delay configurable.

The fixed 5ms sleep might not be optimal for all deployment scenarios. Different workloads might benefit from different retry intervals.

Consider making this configurable:

+                        // TODO: Make retry delay configurable
                         tokio::time::sleep(Duration::from_millis(5)).await;
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7fbd43a and 5af3eff.

📒 Files selected for processing (16)
  • components/frontend/src/dynamo/frontend/main.py (3 hunks)
  • components/router/src/main.rs (2 hunks)
  • docs/components/router/README.md (2 hunks)
  • launch/dynamo-run/src/flags.rs (2 hunks)
  • launch/dynamo-run/src/lib.rs (1 hunks)
  • lib/bindings/python/rust/llm/entrypoint.rs (5 hunks)
  • lib/llm/src/discovery/model_manager.rs (1 hunks)
  • lib/llm/src/entrypoint/input/http.rs (3 hunks)
  • lib/llm/src/http/service.rs (1 hunks)
  • lib/llm/src/http/service/metrics.rs (5 hunks)
  • lib/llm/src/http/service/openai.rs (7 hunks)
  • lib/llm/src/http/service/rate_limiter.rs (1 hunks)
  • lib/llm/src/http/service/service_v2.rs (7 hunks)
  • lib/llm/src/kv_router.rs (7 hunks)
  • lib/llm/src/kv_router/scheduler.rs (6 hunks)
  • lib/llm/src/local_model.rs (7 hunks)
🧰 Additional context used
🧠 Learnings (9)
lib/llm/src/discovery/model_manager.rs (4)

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/llm/src/http/service.rs (1)

Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

docs/components/router/README.md (2)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

components/router/src/main.rs (4)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

lib/llm/src/kv_router/scheduler.rs (9)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: jthomson04
PR: #1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the wait_for_key_count function correctly uses exact equality (==) instead of greater-than-or-equal (>=) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.

Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: fsaady
PR: #1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

lib/llm/src/kv_router.rs (4)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

lib/llm/src/http/service/rate_limiter.rs (1)

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

lib/llm/src/http/service/metrics.rs (1)

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/bindings/python/rust/llm/entrypoint.rs (4)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

🧬 Code Graph Analysis (2)
lib/llm/src/http/service.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
  • rate_limiter (44-46)
lib/llm/src/entrypoint/input/http.rs (4)
lib/llm/src/http/service/service_v2.rs (1)
  • rate_limiter (44-46)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • DistributedRuntime (30-53)
  • namespace (37-41)
lib/llm/src/http/service/openai.rs (2)
  • chat_completions (459-584)
  • completions (247-342)
lib/bindings/python/rust/lib.rs (1)
  • namespace (272-277)
🪛 GitHub Actions: Copyright Checks
lib/llm/src/http/service/rate_limiter.rs

[error] 1-1: Invalid or missing copyright header detected by copyright checker. File is out of compliance with required SPDX copyright and license header.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
🔇 Additional comments (38)
launch/dynamo-run/src/lib.rs (1)

44-44: LGTM!

The addition of all_workers_busy_rejection_time_window configuration to the LocalModelBuilder is consistent with the rate limiting enhancements described in the PR objectives. The parameter is properly sourced from the command-line flags and integrated into the builder pattern.

lib/llm/src/http/service.rs (1)

27-27: LGTM!

Adding the rate_limiter module to the public API is necessary for exposing the new rate limiting functionality. This aligns with the system-wide rate limiting enhancements and allows other components to access the HTTP service rate limiter.

components/router/src/main.rs (2)

52-54: LGTM!

The new max_workers_busy_queue_depth parameter is well-documented and has a reasonable default value. This configuration allows fine-tuning of when the "all workers busy" event should be triggered based on queue depth.


73-73: LGTM!

The max_workers_busy_queue_depth parameter is correctly passed to the KvRouter::new constructor, ensuring the queue depth configuration is properly propagated to the router implementation.

lib/llm/src/discovery/model_manager.rs (2)

216-216: Good refactoring!

Moving the unwrap_or_default() call to the start of the function eliminates redundant unwrapping and improves code readability. This is a clean improvement to the existing logic.


222-222: LGTM!

The max_workers_busy_queue_depth parameter is correctly passed to the KvRouter::new constructor, ensuring the new queue depth configuration is properly integrated into the KV router creation process.

launch/dynamo-run/src/flags.rs (3)

135-137: LGTM!

The max_workers_busy_queue_depth flag is well-documented and has a reasonable default value of 5. This allows users to configure when the "all workers busy" event should be triggered based on queue depth.


139-141: LGTM!

The all_workers_busy_rejection_time_window flag is appropriately defined as an optional parameter, allowing users to configure the time window for rejecting requests when all workers are busy. The optional nature provides flexibility in configuration.


191-191: LGTM!

The max_workers_busy_queue_depth parameter is correctly integrated into the KvRouterConfig construction, ensuring the queue depth configuration is properly propagated through the router configuration system.

components/frontend/src/dynamo/frontend/main.py (4)

62-67: LGTM!

The command-line argument is properly configured with a sensible default value and clear documentation.


74-79: LGTM!

The rejection time window argument is well-defined with appropriate type and default value. The help text clearly specifies the unit as seconds.


101-101: LGTM!

The parameter is correctly passed to KvRouterConfig when in KV router mode.


114-114: LGTM!

The rejection time window parameter is correctly added to the kwargs for propagation to the engine configuration.

lib/llm/src/http/service/metrics.rs (5)

35-35: LGTM! New rate limiting metric field added correctly.

The rate_limited_requests_count field follows the established pattern and naming convention for metrics in this struct.


193-210: LGTM! Rate limiting metric properly initialized.

The metric initialization follows the established pattern with:

  • Consistent naming convention using the prefix
  • Clear descriptive text
  • Appropriate labels (model, endpoint, request_type) excluding status which makes sense for rate limiting

258-271: LGTM! Getter method implemented correctly.

The method follows the established pattern with proper documentation, parameter types, and label handling consistent with other metric accessors.


273-286: LGTM! Increment method implemented correctly.

The method follows the established pattern with proper documentation and implementation. The public visibility is appropriate since this needs to be called from HTTP request handlers.


309-309: LGTM! Metric properly registered.

The rate limiting metric is correctly registered with the Prometheus registry following the same pattern as other metrics.

lib/llm/src/local_model.rs (4)

50-50: LGTM! New configuration field added correctly.

The all_workers_busy_rejection_time_window field is properly typed as Option<u64> and correctly initialized with the default value in the Default implementation.

Also applies to: 66-66


124-130: LGTM! Builder method implemented correctly.

The method follows the established builder pattern with proper parameter handling and method chaining support.


239-239: LGTM! Field and getter properly implemented.

The field is correctly added to the LocalModel struct and the getter method follows the established naming convention and access pattern.

Also applies to: 271-273


168-168: LGTM! Field properly propagated in all build paths.

The all_workers_busy_rejection_time_window field is correctly propagated to the LocalModel instance in both the echo engine path and the main model loading path, ensuring consistency.

Also applies to: 226-226

lib/bindings/python/rust/llm/entrypoint.rs (3)

37-49: LGTM! KV router queue depth parameter added correctly.

The max_workers_busy_queue_depth parameter is properly integrated with:

  • Appropriate default value of 5 in the pyo3 signature
  • Correct type mapping to usize
  • Proper propagation to the inner Rust configuration struct

98-98: LGTM! Rejection time window parameter added correctly.

The all_workers_busy_rejection_time_window parameter is properly integrated with:

  • Correct optional type Option<u64>
  • Appropriate None default in the pyo3 signature
  • Proper propagation through the constructor and struct initialization

Also applies to: 117-117, 131-131, 153-153


181-182: LGTM! Parameter properly propagated to LocalModelBuilder.

The all_workers_busy_rejection_time_window parameter is correctly passed to the LocalModelBuilder via the appropriate builder method, completing the configuration flow from Python to Rust.

lib/llm/src/http/service/service_v2.rs (4)

12-12: LGTM! Rate limiter properly integrated into HTTP service state.

The HttpServiceRateLimiter is correctly integrated with:

  • Proper import and field addition to the State struct
  • Updated constructor to accept the rate limiter
  • Both reference and clone accessor methods for flexible usage patterns

Also applies to: 23-23, 27-31, 44-46, 52-54


99-100: LGTM! Configuration parameter properly added.

The all_workers_busy_rejection_time_window parameter is correctly integrated into the HttpServiceConfig with proper optional typing and a builder method that follows the established pattern.

Also applies to: 252-258


176-181: LGTM! Rate limiter properly initialized.

The HttpServiceRateLimiter is correctly initialized with:

  • Proper conversion from Option<u64> seconds to Option<Duration>
  • Clean integration into the State constructor

116-118: LGTM! Service-level rate limiter accessor added.

The rate_limiter_clone method provides convenient access to the rate limiter at the service level, following the established pattern for other shared resources.

lib/llm/src/http/service/openai.rs (5)

41-41: LGTM! Rate limiting helper functions well implemented.

The helper functions provide clean abstractions:

  • too_many_requests_error follows the established error response pattern
  • check_rate_limited provides consistent rate limit checking with proper metrics tracking
  • extract_request_type cleanly converts boolean streaming flag to enum

Also applies to: 91-101, 170-193


211-218: LGTM! Rate limiting properly integrated in completions handler.

The rate limiting check is well-positioned early in the request lifecycle, using the helper functions for consistent behavior and proper metrics tracking.


352-358: LGTM! Rate limiting properly integrated in embeddings handler.

The rate limiting check correctly uses RequestType::Unary for embeddings and follows the same early-check pattern as other handlers.


417-424: LGTM! Rate limiting properly integrated in chat completions handler.

The rate limiting check properly extracts the request type from the streaming flag and maintains consistency with other handlers that support streaming.


649-656: LGTM! Rate limiting properly integrated in responses handler.

The rate limiting check follows the established pattern with proper request type extraction and early placement in the request lifecycle.

lib/llm/src/entrypoint/input/http.rs (1)

38-41: LGTM! Good approach to graceful degradation.

The error handling ensures the HTTP service continues to function even if rate limiter monitoring fails to start.

docs/components/router/README.md (1)

45-49: LGTM! Clear and comprehensive documentation.

The documentation clearly explains the new queue management feature, including the command-line argument, queue behavior, and integration with the rate limiting system via NATS events.

Also applies to: 76-84

lib/llm/src/kv_router.rs (1)

53-53: LGTM! Well-structured implementation.

The new queue depth configuration is properly integrated throughout the KV router, with sensible defaults and proper parameter propagation to the scheduler.

Also applies to: 78-78, 88-88, 101-101, 110-111, 156-156, 190-190

lib/llm/src/kv_router/scheduler.rs (1)

243-249: Consider the queue reordering behavior.

The current logic moves already-waiting requests to the front of the queue when they fail to schedule again. This could lead to starvation of requests further back in the queue if a particular request repeatedly fails to schedule.

Consider whether this is the intended behavior or if failed requests should maintain their position in the queue.

@jorgeantonio21 jorgeantonio21 force-pushed the feat/ja/all-workers-busy-rate-limiter branch from 051b1bd to 5b3037f Compare August 2, 2025 11:00
@jorgeantonio21 jorgeantonio21 changed the title feat: enhance http service with rate limiting logic based on all workers busy events feat: enhance http service with request throttling logic based on all workers busy events Aug 4, 2025
@github-actions
Copy link

github-actions bot commented Sep 6, 2025

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Sep 6, 2025
@github-actions
Copy link

This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information.

@github-actions github-actions bot closed this Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/XXL Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants