- 
                Notifications
    You must be signed in to change notification settings 
- Fork 663
feat: enhance http service with request throttling logic based on all workers busy events #2166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enhance http service with request throttling logic based on all workers busy events #2166
Conversation
| 👋 Hi jorgeantonio21! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The  🚀 | 
| WalkthroughThis change introduces configurable queue depth and rate limiting for the KV router and HTTP service. New command-line flags and configuration parameters allow limiting the number of queued requests when all workers are busy, trigger event-based rate limiting, and propagate these settings through the frontend, router, engine, and HTTP layers. The HTTP service now enforces rate limiting, exposes new metrics, and documents the behavior. Changes
 Sequence Diagram(s)sequenceDiagram
    participant Client
    participant Frontend
    participant Router
    participant Scheduler
    participant Worker
    participant HTTPService
    participant RateLimiter
    Client->>Frontend: Submit request
    Frontend->>Router: Forward request
    Router->>Scheduler: Schedule request
    alt All workers busy
        Scheduler->>Scheduler: Queue request (up to max depth)
        alt Queue full
            Scheduler->>Router: Publish "all workers busy" event
            Router->>HTTPService: Emit event
            HTTPService->>RateLimiter: Activate rate limiting (for time window)
            Client->>HTTPService: New request
            HTTPService->>RateLimiter: Check rate limit
            alt Rate limited
                HTTPService-->>Client: 429 Too Many Requests
            else Not rate limited
                HTTPService->>Scheduler: Forward request
            end
        else Queue not full
            Scheduler->>Scheduler: Wait, prioritize queued requests
        end
    else Worker available
        Scheduler->>Worker: Assign request
        Worker-->>Scheduler: Processed
        Scheduler-->>Router: Complete
    end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
 Poem
 Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
 Other keywords and placeholders
 Documentation and Community
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
lib/llm/src/http/service/rate_limiter.rs (1)
37-49: Consider automatically clearing expired rate limit state.The current implementation checks expiration but doesn't clear the expired state, which could lead to inconsistent internal state where
is_rate_limitedremainstruebut the method returnsfalse.Consider this approach to maintain consistent state:
pub fn is_rate_limited(&self) -> bool { if !self.is_rate_limited { return false; } // Check if rate limit duration has expired if let Some(start_time) = self.rate_limit_start { if start_time.elapsed() > self.rate_limit_duration { - return false; + // State has expired, but we can't mutate here due to immutable reference + // Consider adding a separate method or changing this to take &mut self + return false; } } true }Alternatively, add a separate
check_and_clear_expired()method that callers can use to maintain consistent state.lib/llm/src/entrypoint/input/http.rs (1)
121-132: Consider simplifying the function signature.The function could be simplified by passing only the necessary data instead of the entire
EngineConfig:-fn start_rate_limiter_monitoring( - runtime: &DistributedRuntime, - engine_config: &EngineConfig, - rate_limiter: &HttpServiceRateLimiter, -) -> anyhow::Result<()> { - let namespace = engine_config.local_model().endpoint_id().namespace.clone(); - let namespace = runtime.namespace(namespace)?; - rate_limiter.start_monitoring(namespace)?; - Ok(()) -} +fn start_rate_limiter_monitoring( + runtime: &DistributedRuntime, + namespace_name: &str, + rate_limiter: &HttpServiceRateLimiter, +) -> anyhow::Result<()> { + let namespace = runtime.namespace(namespace_name.to_string())?; + rate_limiter.start_monitoring(namespace)?; + Ok(()) +}Then update the call site:
-if let Err(e) = start_rate_limiter_monitoring(&distributed_runtime, &engine_config, &rate_limiter) { +let namespace_name = engine_config.local_model().endpoint_id().namespace.as_str(); +if let Err(e) = start_rate_limiter_monitoring(&distributed_runtime, namespace_name, &rate_limiter) {lib/llm/src/kv_router/scheduler.rs (1)
251-251: Consider making the retry delay configurable.The fixed 5ms sleep might not be optimal for all deployment scenarios. Different workloads might benefit from different retry intervals.
Consider making this configurable:
+ // TODO: Make retry delay configurable tokio::time::sleep(Duration::from_millis(5)).await;
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
- components/frontend/src/dynamo/frontend/main.py(3 hunks)
- components/router/src/main.rs(2 hunks)
- docs/components/router/README.md(2 hunks)
- launch/dynamo-run/src/flags.rs(2 hunks)
- launch/dynamo-run/src/lib.rs(1 hunks)
- lib/bindings/python/rust/llm/entrypoint.rs(5 hunks)
- lib/llm/src/discovery/model_manager.rs(1 hunks)
- lib/llm/src/entrypoint/input/http.rs(3 hunks)
- lib/llm/src/http/service.rs(1 hunks)
- lib/llm/src/http/service/metrics.rs(5 hunks)
- lib/llm/src/http/service/openai.rs(7 hunks)
- lib/llm/src/http/service/rate_limiter.rs(1 hunks)
- lib/llm/src/http/service/service_v2.rs(7 hunks)
- lib/llm/src/kv_router.rs(7 hunks)
- lib/llm/src/kv_router/scheduler.rs(6 hunks)
- lib/llm/src/local_model.rs(7 hunks)
🧰 Additional context used
🧠 Learnings (9)
lib/llm/src/discovery/model_manager.rs (4)
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/llm/src/http/service.rs (1)
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
docs/components/router/README.md (2)
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
components/router/src/main.rs (4)
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
lib/llm/src/kv_router/scheduler.rs (9)
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: jthomson04
PR: #1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the wait_for_key_count function correctly uses exact equality (==) instead of greater-than-or-equal (>=) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.
Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: fsaady
PR: #1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
lib/llm/src/kv_router.rs (4)
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
lib/llm/src/http/service/rate_limiter.rs (1)
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
lib/llm/src/http/service/metrics.rs (1)
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/bindings/python/rust/llm/entrypoint.rs (4)
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
🧬 Code Graph Analysis (2)
lib/llm/src/http/service.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
rate_limiter(44-46)
lib/llm/src/entrypoint/input/http.rs (4)
lib/llm/src/http/service/service_v2.rs (1)
rate_limiter(44-46)lib/bindings/python/src/dynamo/_core.pyi (2)
DistributedRuntime(30-53)
namespace(37-41)lib/llm/src/http/service/openai.rs (2)
chat_completions(459-584)
completions(247-342)lib/bindings/python/rust/lib.rs (1)
namespace(272-277)
🪛 GitHub Actions: Copyright Checks
lib/llm/src/http/service/rate_limiter.rs
[error] 1-1: Invalid or missing copyright header detected by copyright checker. File is out of compliance with required SPDX copyright and license header.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
🔇 Additional comments (38)
launch/dynamo-run/src/lib.rs (1)
44-44: LGTM!The addition of
all_workers_busy_rejection_time_windowconfiguration to theLocalModelBuilderis consistent with the rate limiting enhancements described in the PR objectives. The parameter is properly sourced from the command-line flags and integrated into the builder pattern.lib/llm/src/http/service.rs (1)
27-27: LGTM!Adding the
rate_limitermodule to the public API is necessary for exposing the new rate limiting functionality. This aligns with the system-wide rate limiting enhancements and allows other components to access the HTTP service rate limiter.components/router/src/main.rs (2)
52-54: LGTM!The new
max_workers_busy_queue_depthparameter is well-documented and has a reasonable default value. This configuration allows fine-tuning of when the "all workers busy" event should be triggered based on queue depth.
73-73: LGTM!The
max_workers_busy_queue_depthparameter is correctly passed to theKvRouter::newconstructor, ensuring the queue depth configuration is properly propagated to the router implementation.lib/llm/src/discovery/model_manager.rs (2)
216-216: Good refactoring!Moving the
unwrap_or_default()call to the start of the function eliminates redundant unwrapping and improves code readability. This is a clean improvement to the existing logic.
222-222: LGTM!The
max_workers_busy_queue_depthparameter is correctly passed to theKvRouter::newconstructor, ensuring the new queue depth configuration is properly integrated into the KV router creation process.launch/dynamo-run/src/flags.rs (3)
135-137: LGTM!The
max_workers_busy_queue_depthflag is well-documented and has a reasonable default value of 5. This allows users to configure when the "all workers busy" event should be triggered based on queue depth.
139-141: LGTM!The
all_workers_busy_rejection_time_windowflag is appropriately defined as an optional parameter, allowing users to configure the time window for rejecting requests when all workers are busy. The optional nature provides flexibility in configuration.
191-191: LGTM!The
max_workers_busy_queue_depthparameter is correctly integrated into theKvRouterConfigconstruction, ensuring the queue depth configuration is properly propagated through the router configuration system.components/frontend/src/dynamo/frontend/main.py (4)
62-67: LGTM!The command-line argument is properly configured with a sensible default value and clear documentation.
74-79: LGTM!The rejection time window argument is well-defined with appropriate type and default value. The help text clearly specifies the unit as seconds.
101-101: LGTM!The parameter is correctly passed to
KvRouterConfigwhen in KV router mode.
114-114: LGTM!The rejection time window parameter is correctly added to the kwargs for propagation to the engine configuration.
lib/llm/src/http/service/metrics.rs (5)
35-35: LGTM! New rate limiting metric field added correctly.The
rate_limited_requests_countfield follows the established pattern and naming convention for metrics in this struct.
193-210: LGTM! Rate limiting metric properly initialized.The metric initialization follows the established pattern with:
- Consistent naming convention using the prefix
- Clear descriptive text
- Appropriate labels (model, endpoint, request_type) excluding status which makes sense for rate limiting
258-271: LGTM! Getter method implemented correctly.The method follows the established pattern with proper documentation, parameter types, and label handling consistent with other metric accessors.
273-286: LGTM! Increment method implemented correctly.The method follows the established pattern with proper documentation and implementation. The public visibility is appropriate since this needs to be called from HTTP request handlers.
309-309: LGTM! Metric properly registered.The rate limiting metric is correctly registered with the Prometheus registry following the same pattern as other metrics.
lib/llm/src/local_model.rs (4)
50-50: LGTM! New configuration field added correctly.The
all_workers_busy_rejection_time_windowfield is properly typed asOption<u64>and correctly initialized with the default value in theDefaultimplementation.Also applies to: 66-66
124-130: LGTM! Builder method implemented correctly.The method follows the established builder pattern with proper parameter handling and method chaining support.
239-239: LGTM! Field and getter properly implemented.The field is correctly added to the
LocalModelstruct and the getter method follows the established naming convention and access pattern.Also applies to: 271-273
168-168: LGTM! Field properly propagated in all build paths.The
all_workers_busy_rejection_time_windowfield is correctly propagated to theLocalModelinstance in both the echo engine path and the main model loading path, ensuring consistency.Also applies to: 226-226
lib/bindings/python/rust/llm/entrypoint.rs (3)
37-49: LGTM! KV router queue depth parameter added correctly.The
max_workers_busy_queue_depthparameter is properly integrated with:
- Appropriate default value of 5 in the pyo3 signature
- Correct type mapping to
usize- Proper propagation to the inner Rust configuration struct
98-98: LGTM! Rejection time window parameter added correctly.The
all_workers_busy_rejection_time_windowparameter is properly integrated with:
- Correct optional type
Option<u64>- Appropriate None default in the pyo3 signature
- Proper propagation through the constructor and struct initialization
Also applies to: 117-117, 131-131, 153-153
181-182: LGTM! Parameter properly propagated to LocalModelBuilder.The
all_workers_busy_rejection_time_windowparameter is correctly passed to theLocalModelBuildervia the appropriate builder method, completing the configuration flow from Python to Rust.lib/llm/src/http/service/service_v2.rs (4)
12-12: LGTM! Rate limiter properly integrated into HTTP service state.The
HttpServiceRateLimiteris correctly integrated with:
- Proper import and field addition to the State struct
- Updated constructor to accept the rate limiter
- Both reference and clone accessor methods for flexible usage patterns
Also applies to: 23-23, 27-31, 44-46, 52-54
99-100: LGTM! Configuration parameter properly added.The
all_workers_busy_rejection_time_windowparameter is correctly integrated into theHttpServiceConfigwith proper optional typing and a builder method that follows the established pattern.Also applies to: 252-258
176-181: LGTM! Rate limiter properly initialized.The
HttpServiceRateLimiteris correctly initialized with:
- Proper conversion from
Option<u64>seconds toOption<Duration>- Clean integration into the State constructor
116-118: LGTM! Service-level rate limiter accessor added.The
rate_limiter_clonemethod provides convenient access to the rate limiter at the service level, following the established pattern for other shared resources.lib/llm/src/http/service/openai.rs (5)
41-41: LGTM! Rate limiting helper functions well implemented.The helper functions provide clean abstractions:
too_many_requests_errorfollows the established error response pattern
check_rate_limitedprovides consistent rate limit checking with proper metrics tracking
extract_request_typecleanly converts boolean streaming flag to enumAlso applies to: 91-101, 170-193
211-218: LGTM! Rate limiting properly integrated in completions handler.The rate limiting check is well-positioned early in the request lifecycle, using the helper functions for consistent behavior and proper metrics tracking.
352-358: LGTM! Rate limiting properly integrated in embeddings handler.The rate limiting check correctly uses
RequestType::Unaryfor embeddings and follows the same early-check pattern as other handlers.
417-424: LGTM! Rate limiting properly integrated in chat completions handler.The rate limiting check properly extracts the request type from the streaming flag and maintains consistency with other handlers that support streaming.
649-656: LGTM! Rate limiting properly integrated in responses handler.The rate limiting check follows the established pattern with proper request type extraction and early placement in the request lifecycle.
lib/llm/src/entrypoint/input/http.rs (1)
38-41: LGTM! Good approach to graceful degradation.The error handling ensures the HTTP service continues to function even if rate limiter monitoring fails to start.
docs/components/router/README.md (1)
45-49: LGTM! Clear and comprehensive documentation.The documentation clearly explains the new queue management feature, including the command-line argument, queue behavior, and integration with the rate limiting system via NATS events.
Also applies to: 76-84
lib/llm/src/kv_router.rs (1)
53-53: LGTM! Well-structured implementation.The new queue depth configuration is properly integrated throughout the KV router, with sensible defaults and proper parameter propagation to the scheduler.
Also applies to: 78-78, 88-88, 101-101, 110-111, 156-156, 190-190
lib/llm/src/kv_router/scheduler.rs (1)
243-249: Consider the queue reordering behavior.The current logic moves already-waiting requests to the front of the queue when they fail to schedule again. This could lead to starvation of requests further back in the queue if a particular request repeatedly fails to schedule.
Consider whether this is the intended behavior or if failed requests should maintain their position in the queue.
051b1bd    to
    5b3037f      
    Compare
  
    | This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. | 
| This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information. | 
Overview:
This PR implements a comprehensive request throttler system for the HTTP service that automatically activates when all KV router workers are busy. The system includes a configurable queue depth, NATS event-driven request throttling, and graceful degradation when request throttling is unavailable.
Details:
Core Request Throttler Implementation:
HTTP Service Integration:
all_workers_busy_rejection_time_windowparameterScheduler Queue Implementation:
Service Lifecycle Management:
Configuration & Documentation:
Where should the reviewer start?
Primary files to review:
Secondary files:
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation