-
Notifications
You must be signed in to change notification settings - Fork 692
feat: add HTTP queue metrics for NIM frontend request tracking #2914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add HTTP queue metrics for NIM frontend request tracking #2914
Conversation
WalkthroughAdds a new HTTP queue gauge metric and RAII guard to track queued time before processing, wires it into Metrics, and instruments chat_completions to create and drop the guard at appropriate points. Introduces a new Prometheus metric name and minor doc tweaks. Derives PartialEq for Status and slightly adjusts inflight duration measurement. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant H as HTTP Handler (chat_completions)
participant M as Metrics (HttpQueueGauge)
participant E as Engine/Processing
C->>H: POST /chat/completions
H->>M: create_http_queue_guard(model)
Note right of M: Gauge++ (queued request)
H-->>H: Waits / prepares request
alt Streaming
H->>M: drop(HttpQueueGuard)
Note right of M: Gauge-- (processing begins)
H->>E: Start streaming generation
E-->>H: Stream tokens
H-->>C: Stream response
else Non-streaming
H->>M: drop(HttpQueueGuard)
Note right of M: Gauge-- (processing begins)
H->>E: Generate completion
E-->>H: Completion
H-->>C: JSON response
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
lib/runtime/src/metrics/prometheus_names.rs (1)
48-53: Clarify HTTP_QUEUE semantics (and consider name)Current doc suggests “waiting in HTTP queue before receiving a response,” but the guard is dropped right before stream processing begins, after engine.generate() returns. Clarify that it tracks “from handler entry until engine stream is ready.” Optionally consider
http_queue_depthfor Prometheus naming clarity; otherwise keep as-is but tighten the help text elsewhere.Apply doc tweak:
- /// Number of requests waiting in HTTP queue before receiving a response. - /// This can measure the engine pool exhaustion if the engine is not able to process requests fast enough. + /// Number of requests waiting in the frontend HTTP queue (from handler entry until the engine stream is ready). + /// Useful to detect engine pool exhaustion if the engine cannot accept work promptly. pub const HTTP_QUEUE: &str = "http_queue";lib/llm/src/http/service/metrics.rs (3)
29-34: RAII guard: document invariants and consider Debug deriveAdd a brief note that the guard must not be cloned/moved across processes and that model label is lowercased/sanitized upstream. Deriving
Debughelps during tracing without impacting behavior.-pub struct HttpQueueGuard { +#[derive(Debug)] +pub struct HttpQueueGuard {
145-153: Gauge help string: align with actual lifecycleHelp says “in HTTP processing queue,” but drop occurs when stream is about to be processed. Suggest tightening wording to avoid confusion with inflight metric.
- "Number of requests in HTTP processing queue", + "Requests waiting from HTTP handler entry until engine stream is ready",
335-339: Public factory is fine; add tiny guard-rail for empty modelIf upstream ever passes an empty model, we’ll emit series with model="". Add a debug assert to catch this early.
pub fn create_http_queue_guard(self: Arc<Self>, model: &str) -> HttpQueueGuard { + debug_assert!(!model.is_empty(), "model label should not be empty"); HttpQueueGuard::new(self, model.to_string().to_lowercase()) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
lib/llm/src/http/service/metrics.rs(8 hunks)lib/llm/src/http/service/openai.rs(3 hunks)lib/runtime/src/metrics/prometheus_names.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/llm/src/http/service/openai.rs (1)
lib/llm/src/http/service/metrics.rs (3)
drop(351-354)drop(387-408)drop(512-518)
lib/llm/src/http/service/metrics.rs (2)
lib/llm/src/http/service/service_v2.rs (1)
new(74-86)lib/runtime/src/pipeline/context.rs (1)
registry(257-259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (9)
lib/llm/src/http/service/metrics.rs (7)
21-21: New gauge field looks goodField added and used consistently.
75-75: Deriving PartialEq on Status is fineHelpful for tests and comparisons.
285-292: Inc/dec helpers are correctLabel cardinality (per model) matches existing metrics.
293-301: Registered with RegistryGood—no orphaned metric vectors.
341-348: Constructor order is goodIncrement first; state lives in guard. No issues.
350-355: Drop semantics correctEnsures decrement on all early-returns.
388-408: Micro-optimization to store durationGood change; avoids double Clock reads.
lib/llm/src/http/service/openai.rs (2)
550-550: Drop point for streaming path is appropriateEnds “queue wait” just before arming SSE. Works with the revised creation point.
566-566: Drop point for non-streaming path is appropriateSymmetric with streaming branch.
nnshah1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - a couple thoughts:
-
is there another spot for the response functions? maybe response_handler.rs - having them in metrics.rs seems a little unintuitive.
-
could we make the names shorter? (not sure what would work here) but maybe a single method with flags for the behavior? not sure that makes sense.
Signed-off-by: Keiven Chang <[email protected]>
- Move http_queue_guard creation to beginning of request handlers - Move inflight_guard creation after engine.generate() call - Rename processing functions for clarity (process_streaming_with_*) Signed-off-by: Keiven Chang <[email protected]>
- Move InflightGuard creation before engine.generate() calls for proper error counting - Move EventConverter and related functions from openai.rs to metrics.rs - Rename functions: - process_streaming_with_metrics -> process_response_and_observe_metrics - process_response_using_streaming_event_converter -> process_response_using_event_converter_and_observe_metrics - Consolidate get_parsing_options function in metrics.rs - Update grpc/service/openai.rs to use shared metrics functions - Add documentation for HTTP queue vs inflight metrics Signed-off-by: Keiven Chang <[email protected]>
Ensures consistent model labels across all metrics for the same request. Signed-off-by: Keiven Chang <[email protected]>
- Move get_parsing_options from grpc service to http metrics module - Rename HTTP_QUEUE to HTTP_QUEUED_REQUESTS for clarity - Update comment to specify 'first response' timing Signed-off-by: Keiven Chang <[email protected]>
- Resolved merge conflict in lib/llm/src/http/service/metrics.rs by combining client_disconnect_gauge and http_queue_gauge fields - Updated metric names to use consistent naming convention (added _total suffix) - Moved detailed metrics documentation to deploy/metrics/README.md - Added comprehensive documentation for HTTP queue vs inflight metrics distinction - Updated prometheus_names.rs with centralized metric name constants Signed-off-by: Keiven Chang <[email protected]>
1244cd3 to
2d02d3c
Compare
- Move get_parsing_options function from lib/llm/src/http/service/metrics.rs to lib/llm/src/discovery/model_manager.rs as a method - Update all call sites in openai.rs and kserve.rs to use the new ModelManager method - Remove the old function from metrics.rs - Improves code organization by placing parsing options functionality with ModelManager where it belongs - All tests pass, functionality remains unchanged Signed-off-by: Keiven Chang <[email protected]>
- Add comprehensive naming convention guidelines with prefix/name/suffix structure - Include common transformations from vague suffixes to specific ones - Provide clear examples of correct vs incorrect naming patterns - Document avoidance of vague suffixes like _counter, _gauge, _time, _size, _rate - Add visual indicators (❌/✅) for easy reference - Ensure consistency with Prometheus best practices Signed-off-by: Keiven Chang <[email protected]>
Signed-off-by: Keiven Chang <[email protected]> Signed-off-by: Kristen Kelleher <[email protected]>
Overview:
This PR implements HTTP queue metrics for frontend request tracking to monitor request processing pipeline performance and detect engine pool exhaustion.
Details:
Where should the reviewer start?
lib/llm/src/http/service/metrics.rs- New HTTP queue gauge and HttpQueueGuard implementationlib/llm/src/http/service/openai.rs- Integration of HTTP queue tracking in chat completionslib/runtime/src/metrics/prometheus_names.rs- New metric name constantRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Relates to DIS-564, DIS-563
Summary by CodeRabbit
New Features
Documentation