Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Sep 6, 2025

Overview:

This PR implements HTTP queue metrics for frontend request tracking to monitor request processing pipeline performance and detect engine pool exhaustion.

Details:

  • Add new HTTP queue gauge metric to track requests waiting in HTTP processing queue
  • Implement HttpQueueGuard RAII object for automatic queue time tracking from HTTP handler start to metrics processing
  • Integrate HTTP queue tracking in OpenAI chat completions handler
  • Add HTTP_QUEUE prometheus metric name constant
  • Track time from request arrival until engine processing begins to identify bottlenecks

Where should the reviewer start?

  • lib/llm/src/http/service/metrics.rs - New HTTP queue gauge and HttpQueueGuard implementation
  • lib/llm/src/http/service/openai.rs - Integration of HTTP queue tracking in chat completions
  • lib/runtime/src/metrics/prometheus_names.rs - New metric name constant

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to DIS-564, DIS-563

Summary by CodeRabbit

  • New Features

    • Added an HTTP queue metric to monitor the number of requests waiting before processing, exposed via Prometheus with per-model labeling.
    • Improved timing accuracy for request lifecycle metrics.
    • Instrumentation updates in chat completions ensure queue time is measured until processing begins, including streaming and non-streaming paths.
  • Documentation

    • Clarified the definition of the inflight requests metric to indicate it covers requests sent to the engine (e.g., vLLM, SGLang).

@keivenchang keivenchang requested a review from a team as a code owner September 6, 2025 03:08
@github-actions github-actions bot added the feat label Sep 6, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 6, 2025

Walkthrough

Adds a new HTTP queue gauge metric and RAII guard to track queued time before processing, wires it into Metrics, and instruments chat_completions to create and drop the guard at appropriate points. Introduces a new Prometheus metric name and minor doc tweaks. Derives PartialEq for Status and slightly adjusts inflight duration measurement.

Changes

Cohort / File(s) Summary
HTTP metrics: queue gauge & guard
lib/llm/src/http/service/metrics.rs
Adds http_queue_gauge to Metrics; introduces public HttpQueueGuard with RAII increment/decrement; exposes create_http_queue_guard(…) on Metrics; derives PartialEq for Status; adjusts InflightGuard drop to use a computed duration variable.
OpenAI handler instrumentation
lib/llm/src/http/service/openai.rs
Creates HttpQueueGuard at chat_completions start; drops it immediately before processing begins in both streaming and non‑streaming paths; removes a trace log.
Prometheus metric name
lib/runtime/src/metrics/prometheus_names.rs
Adds frontend_service::HTTP_QUEUE constant and description; clarifies INFLIGHT_REQUESTS docs.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant C as Client
    participant H as HTTP Handler (chat_completions)
    participant M as Metrics (HttpQueueGauge)
    participant E as Engine/Processing

    C->>H: POST /chat/completions
    H->>M: create_http_queue_guard(model)
    Note right of M: Gauge++ (queued request)

    H-->>H: Waits / prepares request

    alt Streaming
        H->>M: drop(HttpQueueGuard)
        Note right of M: Gauge-- (processing begins)
        H->>E: Start streaming generation
        E-->>H: Stream tokens
        H-->>C: Stream response
    else Non-streaming
        H->>M: drop(HttpQueueGuard)
        Note right of M: Gauge-- (processing begins)
        H->>E: Generate completion
        E-->>H: Completion
        H-->>C: JSON response
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

I queued with a twitch of a whisker thin,
Counting requests with a gentle grin.
Gauge goes up, then hops back down—
When work begins, I’m out of town.
Metrics nibble, carrots tall,
Streaming or not, I watch them all. 🥕🐇


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
lib/runtime/src/metrics/prometheus_names.rs (1)

48-53: Clarify HTTP_QUEUE semantics (and consider name)

Current doc suggests “waiting in HTTP queue before receiving a response,” but the guard is dropped right before stream processing begins, after engine.generate() returns. Clarify that it tracks “from handler entry until engine stream is ready.” Optionally consider http_queue_depth for Prometheus naming clarity; otherwise keep as-is but tighten the help text elsewhere.

Apply doc tweak:

-    /// Number of requests waiting in HTTP queue before receiving a response.
-    /// This can measure the engine pool exhaustion if the engine is not able to process requests fast enough.
+    /// Number of requests waiting in the frontend HTTP queue (from handler entry until the engine stream is ready).
+    /// Useful to detect engine pool exhaustion if the engine cannot accept work promptly.
     pub const HTTP_QUEUE: &str = "http_queue";
lib/llm/src/http/service/metrics.rs (3)

29-34: RAII guard: document invariants and consider Debug derive

Add a brief note that the guard must not be cloned/moved across processes and that model label is lowercased/sanitized upstream. Deriving Debug helps during tracing without impacting behavior.

-pub struct HttpQueueGuard {
+#[derive(Debug)]
+pub struct HttpQueueGuard {

145-153: Gauge help string: align with actual lifecycle

Help says “in HTTP processing queue,” but drop occurs when stream is about to be processed. Suggest tightening wording to avoid confusion with inflight metric.

-                "Number of requests in HTTP processing queue",
+                "Requests waiting from HTTP handler entry until engine stream is ready",

335-339: Public factory is fine; add tiny guard-rail for empty model

If upstream ever passes an empty model, we’ll emit series with model="". Add a debug assert to catch this early.

 pub fn create_http_queue_guard(self: Arc<Self>, model: &str) -> HttpQueueGuard {
+    debug_assert!(!model.is_empty(), "model label should not be empty");
     HttpQueueGuard::new(self, model.to_string().to_lowercase())
 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1477f6e and 152131a.

📒 Files selected for processing (3)
  • lib/llm/src/http/service/metrics.rs (8 hunks)
  • lib/llm/src/http/service/openai.rs (3 hunks)
  • lib/runtime/src/metrics/prometheus_names.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/llm/src/http/service/openai.rs (1)
lib/llm/src/http/service/metrics.rs (3)
  • drop (351-354)
  • drop (387-408)
  • drop (512-518)
lib/llm/src/http/service/metrics.rs (2)
lib/llm/src/http/service/service_v2.rs (1)
  • new (74-86)
lib/runtime/src/pipeline/context.rs (1)
  • registry (257-259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (9)
lib/llm/src/http/service/metrics.rs (7)

21-21: New gauge field looks good

Field added and used consistently.


75-75: Deriving PartialEq on Status is fine

Helpful for tests and comparisons.


285-292: Inc/dec helpers are correct

Label cardinality (per model) matches existing metrics.


293-301: Registered with Registry

Good—no orphaned metric vectors.


341-348: Constructor order is good

Increment first; state lives in guard. No issues.


350-355: Drop semantics correct

Ensures decrement on all early-returns.


388-408: Micro-optimization to store duration

Good change; avoids double Clock reads.

lib/llm/src/http/service/openai.rs (2)

550-550: Drop point for streaming path is appropriate

Ends “queue wait” just before arming SSE. Works with the revised creation point.


566-566: Drop point for non-streaming path is appropriate

Symmetric with streaming branch.

Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - a couple thoughts:

  1. is there another spot for the response functions? maybe response_handler.rs - having them in metrics.rs seems a little unintuitive.

  2. could we make the names shorter? (not sure what would work here) but maybe a single method with flags for the behavior? not sure that makes sense.

- Move http_queue_guard creation to beginning of request handlers
- Move inflight_guard creation after engine.generate() call
- Rename processing functions for clarity (process_streaming_with_*)

Signed-off-by: Keiven Chang <[email protected]>
- Move InflightGuard creation before engine.generate() calls for proper error counting
- Move EventConverter and related functions from openai.rs to metrics.rs
- Rename functions:
  - process_streaming_with_metrics -> process_response_and_observe_metrics
  - process_response_using_streaming_event_converter -> process_response_using_event_converter_and_observe_metrics
- Consolidate get_parsing_options function in metrics.rs
- Update grpc/service/openai.rs to use shared metrics functions
- Add documentation for HTTP queue vs inflight metrics

Signed-off-by: Keiven Chang <[email protected]>
Ensures consistent model labels across all metrics for the same request.

Signed-off-by: Keiven Chang <[email protected]>
- Move get_parsing_options from grpc service to http metrics module
- Rename HTTP_QUEUE to HTTP_QUEUED_REQUESTS for clarity
- Update comment to specify 'first response' timing

Signed-off-by: Keiven Chang <[email protected]>
- Resolved merge conflict in lib/llm/src/http/service/metrics.rs by combining client_disconnect_gauge and http_queue_gauge fields
- Updated metric names to use consistent naming convention (added _total suffix)
- Moved detailed metrics documentation to deploy/metrics/README.md
- Added comprehensive documentation for HTTP queue vs inflight metrics distinction
- Updated prometheus_names.rs with centralized metric name constants

Signed-off-by: Keiven Chang <[email protected]>
- Move get_parsing_options function from lib/llm/src/http/service/metrics.rs to lib/llm/src/discovery/model_manager.rs as a method
- Update all call sites in openai.rs and kserve.rs to use the new ModelManager method
- Remove the old function from metrics.rs
- Improves code organization by placing parsing options functionality with ModelManager where it belongs
- All tests pass, functionality remains unchanged

Signed-off-by: Keiven Chang <[email protected]>
- Add comprehensive naming convention guidelines with prefix/name/suffix structure
- Include common transformations from vague suffixes to specific ones
- Provide clear examples of correct vs incorrect naming patterns
- Document avoidance of vague suffixes like _counter, _gauge, _time, _size, _rate
- Add visual indicators (❌/✅) for easy reference
- Ensure consistency with Prometheus best practices

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang merged commit 9fa5450 into main Sep 16, 2025
15 of 17 checks passed
@keivenchang keivenchang deleted the keivenchang/DIS-564__DIS-563__NIM-add-FE-metrics-part1 branch September 16, 2025 17:13
kmkelle-nv pushed a commit that referenced this pull request Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants