feat: enhance http service with request throttling logic based on all workers busy events #2166

jorgeantonio21 · 2025-07-29T15:55:58Z

Overview:

This PR implements a comprehensive request throttler system for the HTTP service that automatically activates when all KV router workers are busy. The system includes a configurable queue depth, NATS event-driven request throttling, and graceful degradation when request throttling is unavailable.

Details:

Core Request Throttler Implementation:

Added HttpServiceRequestThrottler with configurable rejection time window
Implemented RequestThrottlerState with automatic expiration logic
Created NATS event subscription for "all workers busy" events from KV router scheduler
Added request throttling checks to all HTTP endpoints (chat completions, completions, embeddings, responses)

HTTP Service Integration:

Extended HttpServiceConfig with all_workers_busy_rejection_time_window parameter
Updated State struct to include request throttler in shared service state
Modified all request handlers to check request throttling before processing requests
Added proper error responses (429 Too Many Requests) when request throttled

Scheduler Queue Implementation:

Enhanced KV router scheduler with configurable queue depth (max_workers_busy_queue_depth)
Added NATS event publishing when queue reaches maximum depth
Implemented KVAllWorkersBusyEvent for communicating service status
Added automatic queue clearing after configurable duration

Service Lifecycle Management:

Integrated request throttler monitoring into HTTP service startup
Added graceful error handling - service starts even if request throttling fails
Proper separation of concerns between model discovery and request throttling
Updated model watcher to remove request throttling responsibilities

Configuration & Documentation:

Updated Python bindings to support new request throttling configuration
Enhanced router README with queue depth documentation
Added proper error types and logging throughout the system

Where should the reviewer start?

Primary files to review:

lib/llm/src/http/service/request_throttler.rs - Core request throttling implementation
lib/llm/src/http/service/service_v2.rs - HTTP service integration
lib/llm/src/http/service/openai.rs - Request handler request throttling checks
lib/llm/src/kv_router/scheduler.rs - Queue implementation and NATS events
lib/llm/src/entrypoint/input/http.rs - Service startup and lifecycle management

Secondary files:

lib/bindings/python/rust/llm/entrypoint.rs - Python binding updates
docs/components/router/README.md - Documentation updates
lib/llm/src/discovery/watcher.rs - Cleaned up model watcher

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: [FEATURE]: Implement automatic rate limiting when all KV router workers are busy #2165

Summary by CodeRabbit

New Features
- Added configurable queue depth for handling requests when all workers are busy, allowing users to set a maximum waiting queue size.
- Introduced a configurable time window for rejecting new requests with a 429 error when all workers are busy.
- Implemented event-driven request throttling, automatically rejecting requests during periods of high load based on system events.
- Added Prometheus metrics to track request-throttled requests for improved monitoring and observability.
Documentation
- Updated documentation to explain new queue management and request throttling options, including configuration flags and event-driven behaviors.

copy-pr-bot · 2025-07-29T15:56:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-07-29T15:56:07Z

👋 Hi jorgeantonio21! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-07-29T16:01:39Z

Walkthrough

This change introduces configurable queue depth and rate limiting for the KV router and HTTP service. New command-line flags and configuration parameters allow limiting the number of queued requests when all workers are busy, trigger event-based rate limiting, and propagate these settings through the frontend, router, engine, and HTTP layers. The HTTP service now enforces rate limiting, exposes new metrics, and documents the behavior.

Changes

Cohort / File(s)	Change Summary
Frontend CLI Flags & Propagation `components/frontend/src/dynamo/frontend/main.py`	Added `--max-workers-busy-queue-depth` and `--all-workers-busy-rejection-time-window` CLI flags; propagated to router and engine config.
Router CLI Flags & Propagation `components/router/src/main.rs`, `launch/dynamo-run/src/flags.rs`	Added `max_workers_busy_queue_depth` CLI flag (default 5) and `all_workers_busy_rejection_time_window` flag; passed to router config.
Router Config & Scheduler `lib/llm/src/kv_router.rs`, `lib/llm/src/kv_router/scheduler.rs`, `lib/bindings/python/rust/llm/entrypoint.rs`, `lib/llm/src/discovery/model_manager.rs`	Added `max_workers_busy_queue_depth` config; `KvScheduler` now manages a bounded waiting queue, emits "all workers busy" events, and prioritizes queued requests.
Engine & Local Model `lib/llm/src/local_model.rs`, `launch/dynamo-run/src/lib.rs`	Added `all_workers_busy_rejection_time_window` to model builder and model; propagated from CLI/config.
HTTP Service Rate Limiting `lib/llm/src/http/service.rs`, `lib/llm/src/http/service/service_v2.rs`, `lib/llm/src/http/service/rate_limiter.rs`, `lib/llm/src/entrypoint/input/http.rs`	Introduced `HttpServiceRateLimiter` module; HTTP service now monitors "all workers busy" events and enforces rate limiting with configurable time window.
OpenAI API Handlers `lib/llm/src/http/service/openai.rs`	Centralized rate limiting checks; handlers return 429 errors when rate limited; new helper and error response.
HTTP Service Metrics `lib/llm/src/http/service/metrics.rs`	Added Prometheus metric for rate-limited requests, with increment and accessor methods.
Documentation `docs/components/router/README.md`	Documented queue management, new flags, event publishing, and HTTP service rate limiting.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Frontend
    participant Router
    participant Scheduler
    participant Worker
    participant HTTPService
    participant RateLimiter

    Client->>Frontend: Submit request
    Frontend->>Router: Forward request
    Router->>Scheduler: Schedule request
    alt All workers busy
        Scheduler->>Scheduler: Queue request (up to max depth)
        alt Queue full
            Scheduler->>Router: Publish "all workers busy" event
            Router->>HTTPService: Emit event
            HTTPService->>RateLimiter: Activate rate limiting (for time window)
            Client->>HTTPService: New request
            HTTPService->>RateLimiter: Check rate limit
            alt Rate limited
                HTTPService-->>Client: 429 Too Many Requests
            else Not rate limited
                HTTPService->>Scheduler: Forward request
            end
        else Queue not full
            Scheduler->>Scheduler: Wait, prioritize queued requests
        end
    else Worker available
        Scheduler->>Worker: Assign request
        Worker-->>Scheduler: Processed
        Scheduler-->>Router: Complete
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

[FEATURE]: Implement automatic rate limiting when all KV router workers are busy #2165: Implements new configuration for queue depth and busy rejection timing, matching the feature described in this issue.

Poem

A rabbit in queues, with ears to the ground,
Hops through the routers where limits are found.
When workers are busy and queues overflow,
Events send a warning: "Too many, oh no!"
The service now listens, rate limits in place—
429s hopping, but never a race.
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

lib/llm/src/http/service/rate_limiter.rs (1)
37-49: Consider automatically clearing expired rate limit state.

The current implementation checks expiration but doesn't clear the expired state, which could lead to inconsistent internal state where is_rate_limited remains true but the method returns false.

Consider this approach to maintain consistent state:
 pub fn is_rate_limited(&self) -> bool {
     if !self.is_rate_limited {
         return false;
     }

     // Check if rate limit duration has expired
     if let Some(start_time) = self.rate_limit_start {
         if start_time.elapsed() > self.rate_limit_duration {
-            return false;
+            // State has expired, but we can't mutate here due to immutable reference
+            // Consider adding a separate method or changing this to take &mut self
+            return false;
         }
     }
     true
 }
Alternatively, add a separate check_and_clear_expired() method that callers can use to maintain consistent state.
lib/llm/src/entrypoint/input/http.rs (1)
121-132: Consider simplifying the function signature.

The function could be simplified by passing only the necessary data instead of the entire EngineConfig:
-fn start_rate_limiter_monitoring(
-    runtime: &DistributedRuntime,
-    engine_config: &EngineConfig,
-    rate_limiter: &HttpServiceRateLimiter,
-) -> anyhow::Result<()> {
-    let namespace = engine_config.local_model().endpoint_id().namespace.clone();
-    let namespace = runtime.namespace(namespace)?;
-    rate_limiter.start_monitoring(namespace)?;
-    Ok(())
-}
+fn start_rate_limiter_monitoring(
+    runtime: &DistributedRuntime,
+    namespace_name: &str,
+    rate_limiter: &HttpServiceRateLimiter,
+) -> anyhow::Result<()> {
+    let namespace = runtime.namespace(namespace_name.to_string())?;
+    rate_limiter.start_monitoring(namespace)?;
+    Ok(())
+}
Then update the call site:
-if let Err(e) = start_rate_limiter_monitoring(&distributed_runtime, &engine_config, &rate_limiter) {
+let namespace_name = engine_config.local_model().endpoint_id().namespace.as_str();
+if let Err(e) = start_rate_limiter_monitoring(&distributed_runtime, namespace_name, &rate_limiter) {
lib/llm/src/kv_router/scheduler.rs (1)
251-251: Consider making the retry delay configurable.

The fixed 5ms sleep might not be optimal for all deployment scenarios. Different workloads might benefit from different retry intervals.

Consider making this configurable:
+                        // TODO: Make retry delay configurable
                         tokio::time::sleep(Duration::from_millis(5)).await;

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7fbd43a and 5af3eff.

📒 Files selected for processing (16)

components/frontend/src/dynamo/frontend/main.py (3 hunks)
components/router/src/main.rs (2 hunks)
docs/components/router/README.md (2 hunks)
launch/dynamo-run/src/flags.rs (2 hunks)
launch/dynamo-run/src/lib.rs (1 hunks)
lib/bindings/python/rust/llm/entrypoint.rs (5 hunks)
lib/llm/src/discovery/model_manager.rs (1 hunks)
lib/llm/src/entrypoint/input/http.rs (3 hunks)
lib/llm/src/http/service.rs (1 hunks)
lib/llm/src/http/service/metrics.rs (5 hunks)
lib/llm/src/http/service/openai.rs (7 hunks)
lib/llm/src/http/service/rate_limiter.rs (1 hunks)
lib/llm/src/http/service/service_v2.rs (7 hunks)
lib/llm/src/kv_router.rs (7 hunks)
lib/llm/src/kv_router/scheduler.rs (6 hunks)
lib/llm/src/local_model.rs (7 hunks)

🧰 Additional context used

🧠 Learnings (9)

lib/llm/src/discovery/model_manager.rs (4)

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/llm/src/http/service.rs (1)

Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

docs/components/router/README.md (2)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

components/router/src/main.rs (4)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

lib/llm/src/kv_router/scheduler.rs (9)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: jthomson04
PR: #1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the wait_for_key_count function correctly uses exact equality (==) instead of greater-than-or-equal (>=) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.

Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: fsaady
PR: #1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

lib/llm/src/kv_router.rs (4)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

lib/llm/src/http/service/rate_limiter.rs (1)

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

lib/llm/src/http/service/metrics.rs (1)

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/bindings/python/rust/llm/entrypoint.rs (4)

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

🧬 Code Graph Analysis (2)

lib/llm/src/http/service.rs (1)

lib/llm/src/http/service/service_v2.rs (1)

rate_limiter (44-46)

lib/llm/src/entrypoint/input/http.rs (4)

lib/llm/src/http/service/service_v2.rs (1)

rate_limiter (44-46)

lib/bindings/python/src/dynamo/_core.pyi (2)

DistributedRuntime (30-53)

namespace (37-41)

lib/llm/src/http/service/openai.rs (2)

chat_completions (459-584)

completions (247-342)

lib/bindings/python/rust/lib.rs (1)

namespace (272-277)

🪛 GitHub Actions: Copyright Checks

lib/llm/src/http/service/rate_limiter.rs

[error] 1-1: Invalid or missing copyright header detected by copyright checker. File is out of compliance with required SPDX copyright and license header.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: Build and Test - vllm
GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/runtime/examples)

🔇 Additional comments (38)

launch/dynamo-run/src/lib.rs (1)

44-44: LGTM!

The addition of all_workers_busy_rejection_time_window configuration to the LocalModelBuilder is consistent with the rate limiting enhancements described in the PR objectives. The parameter is properly sourced from the command-line flags and integrated into the builder pattern.

lib/llm/src/http/service.rs (1)

27-27: LGTM!

Adding the rate_limiter module to the public API is necessary for exposing the new rate limiting functionality. This aligns with the system-wide rate limiting enhancements and allows other components to access the HTTP service rate limiter.

components/router/src/main.rs (2)

52-54: LGTM!

The new max_workers_busy_queue_depth parameter is well-documented and has a reasonable default value. This configuration allows fine-tuning of when the "all workers busy" event should be triggered based on queue depth.

73-73: LGTM!

The max_workers_busy_queue_depth parameter is correctly passed to the KvRouter::new constructor, ensuring the queue depth configuration is properly propagated to the router implementation.

lib/llm/src/discovery/model_manager.rs (2)

216-216: Good refactoring!

Moving the unwrap_or_default() call to the start of the function eliminates redundant unwrapping and improves code readability. This is a clean improvement to the existing logic.

222-222: LGTM!

The max_workers_busy_queue_depth parameter is correctly passed to the KvRouter::new constructor, ensuring the new queue depth configuration is properly integrated into the KV router creation process.

launch/dynamo-run/src/flags.rs (3)

135-137: LGTM!

The max_workers_busy_queue_depth flag is well-documented and has a reasonable default value of 5. This allows users to configure when the "all workers busy" event should be triggered based on queue depth.

139-141: LGTM!

The all_workers_busy_rejection_time_window flag is appropriately defined as an optional parameter, allowing users to configure the time window for rejecting requests when all workers are busy. The optional nature provides flexibility in configuration.

191-191: LGTM!

The max_workers_busy_queue_depth parameter is correctly integrated into the KvRouterConfig construction, ensuring the queue depth configuration is properly propagated through the router configuration system.

components/frontend/src/dynamo/frontend/main.py (4)

62-67: LGTM!

The command-line argument is properly configured with a sensible default value and clear documentation.

74-79: LGTM!

The rejection time window argument is well-defined with appropriate type and default value. The help text clearly specifies the unit as seconds.

101-101: LGTM!

The parameter is correctly passed to KvRouterConfig when in KV router mode.

114-114: LGTM!

The rejection time window parameter is correctly added to the kwargs for propagation to the engine configuration.

lib/llm/src/http/service/metrics.rs (5)

35-35: LGTM! New rate limiting metric field added correctly.

The rate_limited_requests_count field follows the established pattern and naming convention for metrics in this struct.

193-210: LGTM! Rate limiting metric properly initialized.

The metric initialization follows the established pattern with:

Consistent naming convention using the prefix

Clear descriptive text

Appropriate labels (model, endpoint, request_type) excluding status which makes sense for rate limiting

258-271: LGTM! Getter method implemented correctly.

The method follows the established pattern with proper documentation, parameter types, and label handling consistent with other metric accessors.

273-286: LGTM! Increment method implemented correctly.

The method follows the established pattern with proper documentation and implementation. The public visibility is appropriate since this needs to be called from HTTP request handlers.

309-309: LGTM! Metric properly registered.

The rate limiting metric is correctly registered with the Prometheus registry following the same pattern as other metrics.

lib/llm/src/local_model.rs (4)

50-50: LGTM! New configuration field added correctly.

The all_workers_busy_rejection_time_window field is properly typed as Option<u64> and correctly initialized with the default value in the Default implementation.

Also applies to: 66-66

124-130: LGTM! Builder method implemented correctly.

The method follows the established builder pattern with proper parameter handling and method chaining support.

239-239: LGTM! Field and getter properly implemented.

The field is correctly added to the LocalModel struct and the getter method follows the established naming convention and access pattern.

Also applies to: 271-273

168-168: LGTM! Field properly propagated in all build paths.

The all_workers_busy_rejection_time_window field is correctly propagated to the LocalModel instance in both the echo engine path and the main model loading path, ensuring consistency.

Also applies to: 226-226

lib/bindings/python/rust/llm/entrypoint.rs (3)

37-49: LGTM! KV router queue depth parameter added correctly.

The max_workers_busy_queue_depth parameter is properly integrated with:

Appropriate default value of 5 in the pyo3 signature

Correct type mapping to usize

Proper propagation to the inner Rust configuration struct

98-98: LGTM! Rejection time window parameter added correctly.

The all_workers_busy_rejection_time_window parameter is properly integrated with:

Correct optional type Option<u64>

Appropriate None default in the pyo3 signature

Proper propagation through the constructor and struct initialization

Also applies to: 117-117, 131-131, 153-153

181-182: LGTM! Parameter properly propagated to LocalModelBuilder.

The all_workers_busy_rejection_time_window parameter is correctly passed to the LocalModelBuilder via the appropriate builder method, completing the configuration flow from Python to Rust.

lib/llm/src/http/service/service_v2.rs (4)

12-12: LGTM! Rate limiter properly integrated into HTTP service state.

The HttpServiceRateLimiter is correctly integrated with:

Proper import and field addition to the State struct

Updated constructor to accept the rate limiter

Both reference and clone accessor methods for flexible usage patterns

Also applies to: 23-23, 27-31, 44-46, 52-54

99-100: LGTM! Configuration parameter properly added.

The all_workers_busy_rejection_time_window parameter is correctly integrated into the HttpServiceConfig with proper optional typing and a builder method that follows the established pattern.

Also applies to: 252-258

176-181: LGTM! Rate limiter properly initialized.

The HttpServiceRateLimiter is correctly initialized with:

Proper conversion from Option<u64> seconds to Option<Duration>

Clean integration into the State constructor

116-118: LGTM! Service-level rate limiter accessor added.

The rate_limiter_clone method provides convenient access to the rate limiter at the service level, following the established pattern for other shared resources.

lib/llm/src/http/service/openai.rs (5)

41-41: LGTM! Rate limiting helper functions well implemented.

The helper functions provide clean abstractions:

too_many_requests_error follows the established error response pattern

check_rate_limited provides consistent rate limit checking with proper metrics tracking

extract_request_type cleanly converts boolean streaming flag to enum

Also applies to: 91-101, 170-193

211-218: LGTM! Rate limiting properly integrated in completions handler.

The rate limiting check is well-positioned early in the request lifecycle, using the helper functions for consistent behavior and proper metrics tracking.

352-358: LGTM! Rate limiting properly integrated in embeddings handler.

The rate limiting check correctly uses RequestType::Unary for embeddings and follows the same early-check pattern as other handlers.

417-424: LGTM! Rate limiting properly integrated in chat completions handler.

The rate limiting check properly extracts the request type from the streaming flag and maintains consistency with other handlers that support streaming.

649-656: LGTM! Rate limiting properly integrated in responses handler.

The rate limiting check follows the established pattern with proper request type extraction and early placement in the request lifecycle.

lib/llm/src/entrypoint/input/http.rs (1)

38-41: LGTM! Good approach to graceful degradation.

The error handling ensures the HTTP service continues to function even if rate limiter monitoring fails to start.

docs/components/router/README.md (1)

45-49: LGTM! Clear and comprehensive documentation.

The documentation clearly explains the new queue management feature, including the command-line argument, queue behavior, and integration with the rate limiting system via NATS events.

Also applies to: 76-84

lib/llm/src/kv_router.rs (1)

53-53: LGTM! Well-structured implementation.

The new queue depth configuration is properly integrated throughout the KV router, with sensible defaults and proper parameter propagation to the scheduler.

Also applies to: 78-78, 88-88, 101-101, 110-111, 156-156, 190-190

lib/llm/src/kv_router/scheduler.rs (1)

243-249: Consider the queue reordering behavior.

The current logic moves already-waiting requests to the front of the queue when they fail to schedule again. This could lead to starvation of requests further back in the queue if a particular request repeatedly fails to schedule.

Consider whether this is the intended behavior or if failed requests should maintain their position in the queue.

lib/llm/src/http/service/rate_limiter.rs

lib/llm/src/kv_router/scheduler.rs

github-actions · 2025-09-06T09:32:07Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2025-09-11T09:33:54Z

This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information.

jorgeantonio21 added 2 commits July 29, 2025 16:22

initial impl for all workers busy hook up with frontend rate limiter

29bfc45

add workers busy args to entrypoint.rs bindings

5af3eff

jorgeantonio21 requested review from nnshah1, piotrm-nvidia, ptarasiewiczNV and tanmayv25 as code owners July 29, 2025 15:56

jorgeantonio21 requested review from a team, GuanLuo, PeaBrane, alec-flowers, biswapanda, grahamking, ishandhanani, jthomson04, kkranen, paulhendricks, rmccorm4, ryanolson, tedzhouhk and tmonty12 as code owners July 29, 2025 15:56

pull-request-size bot added the size/XL label Jul 29, 2025

github-actions bot added external-contribution Pull request is from an external contributor feat labels Jul 29, 2025

coderabbitai bot reviewed Jul 29, 2025

View reviewed changes

jorgeantonio21 added 2 commits July 29, 2025 17:29

address ci failures

48844ff

address code rabbit PR comments

010636f

pull-request-size bot added size/XL and removed size/L labels Jul 31, 2025

add e2e test and resolve merge issues

dd88f3c

pull-request-size bot added size/XXL and removed size/XL labels Jul 31, 2025

jorgeantonio21 added 6 commits August 1, 2025 00:17

address readme inconsistencies and further issues

2ac4ca1

merge main and resolve conflicts

cc3bb13

refactor tests

d6646c3

refactor e2e tests

2a8d439

refactor e2e tests

06c3278

refactor e2e tests

5b3037f

jorgeantonio21 force-pushed the feat/ja/all-workers-busy-rate-limiter branch from 051b1bd to 5b3037f Compare August 2, 2025 11:00

jorgeantonio21 added 10 commits August 2, 2025 12:00

Merge branch 'main' into feat/ja/all-workers-busy-rate-limiter

1695bc0

remove undefined args

b52f164

increase sleep duration

964bbd4

enable setting endpoint-id for entrypoint args

bf81096

correct namespace parsing

1e3af4d

Merge branch 'main' into feat/ja/all-workers-busy-rate-limiter

f7ccc96

correct overflowing on sleep duration

badd02d

correct e2e tests

4915a4a

run pre-commits

a7b6800

Merge branch 'main' into feat/ja/all-workers-busy-rate-limiter

83d6e0d

PeaBrane approved these changes Aug 4, 2025

View reviewed changes

jorgeantonio21 changed the title ~~feat: enhance http service with rate limiting logic based on all workers busy events~~ feat: enhance http service with request throttling logic based on all workers busy events Aug 4, 2025

jorgeantonio21 added 2 commits August 6, 2025 15:01

merge main branch and resolve conflicts

3a519b7

resolve merge issue

6b7b675

github-actions bot added the Stale label Sep 6, 2025

github-actions bot closed this Sep 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enhance http service with request throttling logic based on all workers busy events #2166

feat: enhance http service with request throttling logic based on all workers busy events #2166

Uh oh!

jorgeantonio21 commented Jul 29, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

coderabbitai bot commented Jul 29, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 6, 2025

Uh oh!

github-actions bot commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: enhance http service with request throttling logic based on all workers busy events #2166

feat: enhance http service with request throttling logic based on all workers busy events #2166

Uh oh!

Conversation

jorgeantonio21 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

coderabbitai bot commented Jul 29, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 6, 2025

Uh oh!

github-actions bot commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jorgeantonio21 commented Jul 29, 2025 •

edited

Loading