-
Notifications
You must be signed in to change notification settings - Fork 668
feat: add a hierarchical Prometheus MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint #2008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add a hierarchical Prometheus MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint #2008
Conversation
…mespace now has its own metrics registry Others: Added NamespaceMetrics struct for namespace-level metrics management Refactored HTTP server - removed RuntimeMetrics, simplified to HttpServerState with direct registry access Updated metric naming - changed from single underscore (test_metric) to double underscore (test__metric) for Prometheus compatibility Enhanced hierarchical registries - improved parent-child registry relationships with create_child_registry() method Updated system stats example - aligned component/endpoint names and registry usage patterns
WalkthroughThis update introduces a hierarchical metrics registry framework to the runtime, enabling Prometheus metrics collection and registration at multiple levels (runtime, namespace, component, endpoint). New example projects and a system stats endpoint are added, along with trait implementations to support hierarchical metric naming and registration. HTTP server metrics and uptime logic are refactored to use the new registry. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as system_stats_client
participant Runtime as DistributedRuntime
participant Namespace
participant Component
participant Endpoint
participant Server as system_stats_server
Client->>Runtime: Initialize from config
Client->>Namespace: Access namespace "system"
Client->>Component: Access component
Client->>Endpoint: Get client for endpoint
Client->>Endpoint: Wait for endpoint instances
Client->>Endpoint: Send "hello world" via PushRouter
Endpoint-->>Client: Stream responses (Annotated<String>)
Client->>Component: Scrape stats periodically
Component-->>Client: Return MyStats
Client->>Runtime: Shutdown
Server->>Runtime: Initialize from config
Server->>Namespace: Create namespace
Server->>Component: Create component
Server->>Endpoint: Create endpoint, attach ingress handler
Endpoint->>Server: Handle requests, collect metrics
Endpoint->>Server: Serve stats via handler
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (12)
lib/runtime/examples/system_stats_endpoint/README.md (1)
1-3: README is only a stub – please add minimal usage & build instructionsA single-line heading is not sufficient for an example crate that ships in the repo. At minimum, outline:
• how to run the server & client binaries
• the metrics/Prometheus endpoints they expose
• any required env-vars (e.g. DYN_NAMESPACE)Completing this now saves future devs from hunting through source code.
components/metrics/src/main.rs (2)
34-36:PushRouterandAnnotatedimports are unused – will triggerdead_codewarningsUnless follow-up code uses them, remove these imports (or
#[allow(unused_imports)]behind a feature flag) to keepcargo checkclean.-use dynamo_runtime::{ - error, logging, - pipeline::PushRouter, - protocols::annotated::Annotated, +use dynamo_runtime::{ + error, logging,
178-179: Clarify comment vs. implementationThe added comment states KV hit-rate events are “not used in production”, yet the subscription is compiled unconditionally.
Either guard the block behind a feature (e.g.#[cfg(feature = "metrics-dev")]) or drop the subscription to avoid unnecessary NATS traffic in prod.lib/runtime/examples/Cargo.toml (1)
36-37: Duplicateprometheusversion declarations across workspaces
prometheusis added here and also at the rootCargo.toml(line 69). Keep a single declaration to avoid version drift; using only the root-level[workspace.dependencies]is sufficient.Cargo.toml (2)
23-27: Development-only example crates are committed as active workspace membersThe inline note says “Do not add this for production”, yet the paths are uncommented.
If crates.io publishing is planned, they’ll be bundled and increase compile time. Consider:[workspace] exclude = ["lib/runtime/examples/*"] # or members = ["lib/runtime", "..."] # omit examplesAlternatively wrap with
cfg(workspace_example)and enable only in CI.
68-70: Prometheus dependency: follow-up TODO should have an issue linkMoving Prometheus into
lib/runtimewill be a breaking change across multiple crates. Create a tracking issue and reference it here to avoid this TODO becoming stale.lib/runtime/src/distributed.rs (1)
219-231: Consider using tracing instead of println! for debug output.Debug methods in library code should use the logging framework rather than printing directly to stdout.
-/// Debug function to print out all the keys of prometheus_registries_by_prefix -pub fn debug_prometheus_registry_keys(&self) { +/// Debug function to log all the keys of prometheus_registries_by_prefix +pub fn debug_prometheus_registry_keys(&self) { let registries = self.prometheus_registries_by_prefix.lock().unwrap(); - println!("=== Prometheus Registry Keys ==="); + tracing::debug!("=== Prometheus Registry Keys ==="); if registries.is_empty() { - println!("No registries found"); + tracing::debug!("No registries found"); } else { for (key, _registry) in registries.iter() { - println!("Registry key: '{}'", key); + tracing::debug!("Registry key: '{}'", key); } } - println!("=== End Prometheus Registry Keys ==="); + tracing::debug!("=== End Prometheus Registry Keys ==="); }Alternatively, consider returning the keys as a
Vec<String>for programmatic access:pub fn get_prometheus_registry_keys(&self) -> Vec<String> { let registries = self.prometheus_registries_by_prefix.lock().unwrap(); registries.keys().cloned().collect() }lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (1)
126-133: Consider using tracing instead of println! in the stats handler.For consistency with logging practices, the stats handler should use the tracing framework.
.stats_handler(|stats| { - println!("stats: {:?}", stats); + tracing::debug!("stats: {:?}", stats); let stats = MyStats { val: 10 }; serde_json::to_value(stats).unwrap() })lib/runtime/src/profiling.rs (2)
192-204: Consider reducing mutex lock scope for better performance.The current implementation holds the mutex lock while cloning the registry. Consider releasing the lock earlier to reduce contention.
fn prometheus_metrics_fmt(&self) -> anyhow::Result<String> { - let mut registry = self.drt().prometheus_registries_by_prefix.lock().unwrap(); - let prometheus_registry = registry - .entry(self.prefix()) - .or_insert(prometheus::Registry::new()) - .clone(); + let prometheus_registry = { + let mut registry = self.drt().prometheus_registries_by_prefix.lock().unwrap(); + registry + .entry(self.prefix()) + .or_insert(prometheus::Registry::new()) + .clone() + }; // Lock is released here let metric_families = prometheus_registry.gather(); let encoder = prometheus::TextEncoder::new(); let mut buffer = Vec::new(); encoder.encode(&metric_families, &mut buffer)?; Ok(String::from_utf8(buffer)?) }
269-288: Consider reducing println! usage in tests.While println! statements can be helpful during test development, they can clutter test output in CI. Consider using them only when tests fail or behind a debug flag.
-// Print only the histogram section for inspection -println!("Histogram section:"); -for line in prometheus_output.lines() { - if line.contains("my_histogram") - || line.contains("# HELP testprefix_my_histogram") - || line.contains("# TYPE testprefix_my_histogram") - { - println!("{}", line); - } -} +// Optionally print histogram section for debugging +if std::env::var("DEBUG_TESTS").is_ok() { + println!("Histogram section:"); + for line in prometheus_output.lines() { + if line.contains("my_histogram") + || line.contains("# HELP testprefix_my_histogram") + || line.contains("# TYPE testprefix_my_histogram") + { + println!("{}", line); + } + } +}lib/runtime/src/http_server.rs (2)
107-107: Remove extra forward slash in commentThe comment has an extra forward slash that should be removed.
-// // Initialize the start time +// Initialize the start time
178-178: Remove commented code referencing old functionalityThis commented line references
DB_errorswhich appears to be from the old implementation and should be removed.- //let health = state.DB_errors <= 0;
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.locklib/runtime/examples/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (16)
Cargo.toml(2 hunks)components/metrics/src/main.rs(2 hunks)lib/runtime/examples/Cargo.toml(2 hunks)lib/runtime/examples/system_stats_endpoint/Cargo.toml(1 hunks)lib/runtime/examples/system_stats_endpoint/README.md(1 hunks)lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs(1 hunks)lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs(1 hunks)lib/runtime/examples/system_stats_endpoint/src/lib.rs(1 hunks)lib/runtime/src/component.rs(3 hunks)lib/runtime/src/component/endpoint.rs(1 hunks)lib/runtime/src/component/namespace.rs(4 hunks)lib/runtime/src/distributed.rs(5 hunks)lib/runtime/src/http_server.rs(3 hunks)lib/runtime/src/lib.rs(2 hunks)lib/runtime/src/profiling.rs(1 hunks)lib/runtime/src/traits.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (10)
lib/runtime/src/component/endpoint.rs (3)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/runtime/src/traits.rs (2)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
components/metrics/src/main.rs (9)
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
lib/runtime/examples/system_stats_endpoint/src/lib.rs (3)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (2)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
lib/runtime/src/component/namespace.rs (5)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.
lib/runtime/src/distributed.rs (4)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/runtime/src/lib.rs (3)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (1)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
lib/runtime/src/http_server.rs (4)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
🧬 Code Graph Analysis (3)
lib/runtime/src/component.rs (2)
lib/runtime/src/component/namespace.rs (2)
basename(86-88)parent_hierarchy(90-92)lib/runtime/src/profiling.rs (9)
basename(105-105)basename(226-228)basename(310-312)basename(330-332)parent_hierarchy(121-121)parent_hierarchy(229-231)parent_hierarchy(313-315)parent_hierarchy(333-335)prefix(109-118)
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (6)
lib/runtime/src/distributed.rs (4)
new(45-129)from_settings(131-134)from_settings(242-248)runtime(142-144)lib/runtime/src/service.rs (1)
new(37-39)lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (1)
app(30-54)lib/runtime/src/protocols/annotated.rs (1)
from_data(56-63)lib/runtime/src/profiling.rs (3)
drt(220-222)drt(304-306)drt(324-326)lib/runtime/src/pipeline/network.rs (1)
for_engine(308-319)
lib/runtime/src/http_server.rs (8)
lib/runtime/src/traits.rs (4)
drt(26-26)drt(41-43)rt(21-21)rt(30-32)lib/runtime/src/profiling.rs (11)
drt(220-222)drt(304-306)drt(324-326)basename(105-105)basename(226-228)basename(310-312)basename(330-332)parent_hierarchy(121-121)parent_hierarchy(229-231)parent_hierarchy(313-315)parent_hierarchy(333-335)lib/runtime/src/component/namespace.rs (2)
basename(86-88)parent_hierarchy(90-92)lib/runtime/src/distributed.rs (4)
basename(35-37)parent_hierarchy(39-41)new(45-129)from_settings_without_discovery(137-140)lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (2)
new(40-57)new(76-78)lib/runtime/src/service.rs (1)
new(37-39)lib/runtime/src/runtime.rs (1)
handle(123-128)lib/runtime/src/pipeline/context.rs (1)
registry(251-253)
🔇 Additional comments (27)
lib/runtime/examples/Cargo.toml (1)
18-21: Example crates are part of the default workspace build – contradicting the guidance commentIncluding
system_stats_endpointinmembersmeans it will be built/published with every workspace action, which the comment in the root Cargo.toml suggests should not happen for production.
Confirm this is intentional; otherwise gate examples behind an env flag or move them to[workspace.exclude].lib/runtime/examples/system_stats_endpoint/Cargo.toml (1)
1-34: LGTM! Well-structured Cargo.toml for the system stats endpoint example.The manifest follows standard Rust conventions with proper workspace inheritance and appropriate dependencies for a system stats endpoint demonstration.
lib/runtime/src/traits.rs (1)
35-44: LGTM! Clear trait implementation with excellent documentation.The implementation correctly provides the distributed runtime instance reference, and the detailed comments explain the necessity for satisfying the
MetricsRegistrytrait bounds. This follows the expected pattern for provider traits.lib/runtime/examples/system_stats_endpoint/src/lib.rs (1)
16-24: LGTM! Simple and appropriate example structure.The constant and struct are well-defined for demonstration purposes. The serde derives are correctly applied for serialization support.
lib/runtime/src/lib.rs (2)
43-43: LGTM! Profiling module addition supports the new metrics framework.The addition of the profiling module is appropriate for the hierarchical metrics registry functionality being introduced.
103-108: LGTM! Well-structured metrics registry storage with proper thread safety.The
prometheus_registries_by_prefixfield uses appropriate thread-safe types for concurrent access. The commented TODO fields are properly documented for future health and liveness check implementations.lib/runtime/src/component/namespace.rs (4)
22-24: LGTM! Appropriate trait imports for the new metrics framework.The imports are correctly added to support the new trait-based approach for metrics registry and distributed runtime access.
47-52: LGTM! Consistent trait-based access pattern.The change to use
DistributedRuntimeProvider::drt(self)follows the new trait-based approach consistently with other similar changes in the codebase.
62-67: LGTM! Consistent trait-based access pattern.The change to use
DistributedRuntimeProvider::drt(self)maintains consistency with the new trait-based approach for accessing the distributed runtime.
85-93: LGTM! Appropriate MetricsRegistry implementation for hierarchical metrics.The implementation correctly provides the namespace name as the basename and establishes the hierarchical relationship by including the distributed runtime prefix in the parent hierarchy. This follows the expected pattern for the metrics registry framework.
lib/runtime/src/component.rs (3)
32-34: Import looks good.The addition of
profiling::MetricsRegistryto the imports is necessary for the trait implementations.
173-185: MetricsRegistry implementation for Component is correct.The implementation properly defines the component's place in the metrics hierarchy by returning its name as basename and constructing the parent hierarchy from the namespace's hierarchy plus the namespace's prefix.
319-331: MetricsRegistry implementation for Endpoint follows the correct pattern.The implementation maintains the hierarchical structure by returning the endpoint's name as basename and building the parent hierarchy from the component's hierarchy plus the component's prefix.
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (2)
1-28: Example client structure looks good.The imports and overall structure are appropriate for a distributed runtime client example.
30-54: Well-structured example demonstrating distributed runtime usage.The client properly initializes the runtime, waits for endpoint instances, sends a request, handles streamed responses, scrapes statistics, and cleanly shuts down. This provides a clear example of the distributed runtime client APIs.
lib/runtime/src/distributed.rs (3)
34-42: Correct root-level MetricsRegistry implementation.The implementation properly establishes DistributedRuntime as the root of the metrics hierarchy with an empty basename and no parent hierarchy.
79-88: Excellent handling of Rust ownership rules.The code correctly extracts the cancel_token from runtime before moving it into the struct, preventing ownership issues. The comment clearly explains why this is necessary.
97-101: Thread-safe registry storage is well-designed.The
prometheus_registries_by_prefixfield properly usesArc<std::sync::Mutex<HashMap>>to enable thread-safe access to Prometheus registries indexed by their hierarchical prefixes.lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (2)
33-58: Well-designed metrics struct demonstrating the MetricsRegistry trait usage.The implementation properly uses Arc for shared ownership, accepts a generic MetricsRegistry parameter for flexibility, and follows Prometheus naming conventions for metrics.
71-106: Good example of AsyncEngine with metrics instrumentation.The implementation properly demonstrates request counting, duration measurement, and stream-based response handling. The metrics are recorded at appropriate points in the request lifecycle.
lib/runtime/src/profiling.rs (2)
51-98: Excellent macro design with clear documentation.The macro elegantly solves the constraints of maintaining
dyncompatibility while handling the Box/clone requirements. The extensive comments clearly explain the technical reasoning, making the code maintainable.
103-122: Well-designed trait with proper hierarchical support.The trait correctly defines the contract for hierarchical metrics with appropriate methods for basename, parent hierarchy, and prefix calculation. The default prefix() implementation properly handles edge cases.
lib/runtime/src/http_server.rs (5)
16-21: Imports look goodThe new imports for
MetricsRegistry,DistributedRuntimeProvider,OnceLock, andInstantare appropriate for the refactored metrics implementation.
26-44: Well-structured metrics registry implementationThe
HttpMetricsRegistrycorrectly implements the required traits and follows the hierarchical naming pattern established in the codebase.
200-208: Good test utility functionThe
make_test_drt()helper properly creates a test instance of DistributedRuntime for use in tests.
284-335: Comprehensive test coverage for metrics functionalityThe tests thoroughly verify:
- Start time initialization and restrictions
- Uptime calculation
- Metrics formatting with correct namespace prefixes
- Gauge value updates
Great job on the test coverage!
354-363: Test confirms panic behaviorThis test verifies the current panic behavior when
uptime()is called without initialization. If you implement my suggestion to return aResultinstead, this test would need to be updated to verify the error case.
|
Addressed the concern-- I removed the hierarchical prefix, and added auto-labeling. Currently working addressing the nitpick items, but do please start the review while I fix them. Thanks. |
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Overview:
Added hierarchical Metrics trait for DistributedRuntime, Namespace, Component, and Endpoint.
Details:
Where should the reviewer start?
lib/runtime/src/metrics.rs
lib/runtime/examples/system_metrics/src/bin/system_server.rs
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
DIS-216
Summary by CodeRabbit
New Features
MetricsRegistry, and provided trait implementations for core entities.Improvements
Documentation
Tests
Chores