More updates

grahamking · grahamking · commit 112f3437b87f · 2025-07-23T16:53:49.000-04:00
diff --git a/components/metrics/README.md b/components/metrics/README.md
@@ -74,18 +74,14 @@ metrics --component MyComponent --endpoint my_endpoint
 To run a more realistic deployment to gathering metrics from,
 see the examples in [examples/llm](../../examples/llm).
 
-For example, for a VLLM + KV Routing based deployment that
-exposes statistics on an endpoint labeled
-`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
-with any other example such as examples/vllm_v0, vllm_v1, ...):
 ```bash
-cd deploy/examples/llm
-dynamo serve graphs.agg:Frontend -f configs/agg.yaml
+python -m dynamo.frontend &
+python -m dynamo.vllm --model-path <your-model-checkout>
 ```
 
 Then, to monitor the metrics of these VllmWorkers, run:
 ```bash
-metrics --component VllmWorker --endpoint load_metrics
+metrics --component backend --endpoint load_metrics
 ```
 
 **NOTE**: `load_metrics` is currently a
diff --git a/docs/architecture/disagg_serving.md b/docs/architecture/disagg_serving.md
@@ -2,18 +2,6 @@
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->
 
 # Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
@@ -117,78 +105,3 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r
 - Add prefill worker: no explicit action needed.
 - Delete prefill worker: flush engine.
 
-### How this works under the hood
-
-#### Auto-Discovery for new workers
-
-In Dynamo, we use `etcd` (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to `etcd` allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in `etcd`. Newly added prefill workers also register with `etcd` for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.
-
-You can watch this happen live by running the following:
-
-```bash
-# in terminal 1 - run the disaggregated serving example
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
-```
-
-```bash
-# in terminal 2 - watch the namespace in etcd
-watch -cd etcdctl get --prefix <namespace>
-```
-
-You should see something like this show up as the disaggregated serving example starts up:
-
-```bash
-# worker information
-dynamo/components/PrefillWorker/mock:694d967da694ea1e
-{
-  "component": "PrefillWorker",
-  "endpoint": "mock",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009310,
-  "transport": {
-    "nats_tcp": "dynamo_prefillworker_0d6df828.mock-694d967da694ea1e"
-  }
-}
-dynamo/components/Processor/chat/completions:694d967da694ea16
-{
-  "component": "Processor",
-  "endpoint": "chat/completions",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009302,
-  "transport": {
-    "nats_tcp": "dynamo_processor_3816642d.chat/completions-694d967da694ea16"
-  }
-}
-dynamo/components/VllmWorker/generate:694d967da694ea1a
-{
-  "component": "VllmWorker",
-  "endpoint": "generate",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009306,
-  "transport": {
-    "nats_tcp": "dynamo_vllmworker_3f6fafd3.generate-694d967da694ea1a"
-  }
-}
-dynamo/components/VllmWorker/load_metrics:694d967da694ea1a
-{
-  "component": "VllmWorker",
-  "endpoint": "load_metrics",
-  "namespace": "dynamo",
-  "lease_id": 7587886413599009306,
-  "transport": {
-    "nats_tcp": "dynamo_vllmworker_3f6fafd3.load_metrics-694d967da694ea1a"
-  }
-}
-
-# nixl metadata
-dynamo/nixl_metadata/e318db87-be55-4c18-9829-8036e1e603e2
-```
-
-#### Graceful worker shutdown
-
-Since worker information is stored in etcd, we can shutdown workers by simply revoking their etcd leases. After a lease is revoked:
-
-- Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
-- Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish
-
-You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: [revoke_lease.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/revoke_lease.py).
diff --git a/docs/architecture/kv_cache_routing.md b/docs/architecture/kv_cache_routing.md
@@ -3,7 +3,8 @@ SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All
 SPDX-License-Identifier: Apache-2.0
 -->
 
-## NEW
+# KV Cache Routing
+This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
 
 To enable KV cache aware routing start the frontend node like this:
 ```
@@ -22,26 +23,8 @@ The KV-aware routing arguments:
 
 - `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
 
----
-
-## OLD - NEEDS MERGING WITH ABOVE
-
->[!NOTE]
->This information is temporary and will change soon.
-
-# KV Cache Routing
-This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
 
 ## Architecture
-Dynamo's architecture consists of three key concepts:
-
-- **Namespace**: Groups related components (similar to directories in a file system). In our examples, we use the label `dynamo`. This avoids collisions between two different dynamo graphs.
-- **Component**: The deployable unit in Dynamo. Components are self-contained and typically map to separate Docker containers. In our examples, we use labels like `VllmWorker `, `Router`, `Processor` for the components. Components can be created in Python or Rust.
-- **Endpoint**: Functions attached to components that transform inputs into outputs. Endpoints are discoverable and callable by other components. In our examples we use the label `generate` for most of the endpoints.
-
-A Dynamo graph is a collection of components that are linked together to form a graph. There are two paths through the graphs. The request path and the response path. For LLMs the request path is single-in (a single message) and the response path is many-out (streamed output).
-
-A common pattern is to spin up multiple of the same components that serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
 
 Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.
 
@@ -161,32 +144,6 @@ The KVIndexer builds and maintains a global view of cached blocks in a prefix tr
 
 The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
 
-Example:
-```python
-from dynamo.llm import KvIndexer
-from dynamo.sdk import dynamo_context
-
-runtime = dynamo_context["runtime"]
-kv_listener = runtime.namespace("dynamo").component("VllmWorker")
-await kv_listener.create_service()
-
-indexer = KvIndexer(kv_listener, block_size=16)
-indexer.find_matches_for_request([INPUT SEQUENCE OF TOKEN IDs])
-```
-
-Sample Output:
-```
-{
-	123456789: 10,
-	987654321: 3,
-	543219876: 7,
-}
-```
-
-```{note}
-This example is designed to help you understand KV cache routing; it won't run outside of the context of dynamo serve. See the examples/ directory for runnable examples.
-```
-
 ### WorkerMetricsPublisher
 We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
 - num_requests_waiting
@@ -202,48 +159,3 @@ Currently, the WorkerMetricsPublisher exists as a Python binding.
 ### KvMetricsAggregator
 The KvMetricsAggregator receives these metrics and aggregates them. It has a method `get_metrics` which returns an object of `AggregatedMetrics`.
 
-Example:
-```python
-from dynamo.llm import KvMetricsAggregator
-from dynamo.sdk import dynamo_context
-
-runtime = dynamo_context["runtime"]
-kv_listener = runtime.namespace("dynamo").component("VllmWorker")
-await kv_listener.create_service()
-metrics_aggregator = KvMetricsAggregator(kv_listener)
-
-for endpoint in metrics_aggregator.get_metrics().endpoints:
-    print("Worker ID: ", endpoint.worker_id)
-    print("GPU Cache Usage: ", endpoint.gpu_cache_usage_perc)
-    print("Number of Requests Waiting: ", endpoint.num_requests_waiting)
-    print("GPU Prefix Cache Hit Rate: ", endpoint.gpu_prefix_cache_hit_rate)
-    print("***")
-```
-
-Sample Output:
-```
-Worker ID: 123456789
-GPU Cache Usage: 0.5
-Number of Requests Waiting: 2
-GPU Prefix Cache Hit Rate: 0.1
-***
-Worker ID: 987654321
-GPU Cache Usage: 0.5
-Number of Requests Waiting: 1
-GPU Prefix Cache Hit Rate: 0.1
-***
-```
-
-```{note}
-This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
-```
-
-### [KV Router](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/kv_router.py)
-The Router component makes intelligent worker selection decisions
-1. Receives incoming requests as tokens
-2. Queries the KVIndexer to find potential cache hits across workers
-3. Collects performance metrics from workers (via KvMetricsAggregator)
-4. Uses a cost function to determine the optimal worker for each request
-5. Returns chosen worker
-
-The processor manages tokenizing the request, sending it to the KV Router and then once it receives a response, directs the request to the selected worker using direct() routing.
diff --git a/docs/guides/planner_benchmark/OLD_disagg_1p1d.yml b/docs/guides/planner_benchmark/OLD_disagg_1p1d.yml
diff --git a/docs/guides/planner_benchmark/OLD_disagg_2p2d.yaml b/docs/guides/planner_benchmark/OLD_disagg_2p2d.yaml
diff --git a/docs/guides/planner_benchmark/README.md b/docs/guides/planner_benchmark/README.md
@@ -1,18 +1,6 @@
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->
 
 # Planner Benchmark Example
@@ -50,8 +38,8 @@ For other models and GPU SKUs, adjust the request rate ranges accordingly to mat
 To measure the performance of dynamo with planner, we start from a 1p1d deployment and set planner to make adjustments every 10 seconds:
 
 ```bash
-cd examples/llm
-dynamo serve graphs.disagg_router:Frontend -f disagg_1p1d.yml
+# Start Kubernetes with one frontend node, one prefill and one decode worker
+# TODO
 
 # in terminal 2
 genai-perf profile \
@@ -84,7 +72,8 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
 
 ```bash
 # in terminal 1
-dynamo serve graphs.disagg_router:Frontend -f disagg_2p2d.yml
+# Start Kubernetes with one frontend node, two prefill and two decode workers
+# TODO
 
 # in terminal 2
 genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
diff --git a/lib/llm/src/discovery/watcher.rs b/lib/llm/src/discovery/watcher.rs
@@ -178,7 +178,6 @@ impl ModelWatcher {
                 Some(card)
             }
             Err(err) => {
-                // `dynamo serve` isn't using MDC yet so can't be an error
                 tracing::info!(%err, "load_mdc did not complete");
                 None
             }

Original file line number	Diff line number	Diff line change
`@@ -178,7 +178,6 @@ impl ModelWatcher {`
`178`	`178`	`Some(card)`
`179`	`179`	`}`
`180`	`180`	`Err(err) => {`
`181`		- // `dynamo serve` isn't using MDC yet so can't be an error
`182`	`181`	`tracing::info!(%err, "load_mdc did not complete");`
`183`	`182`	`None`
`184`	`183`	`}`