Skip to content

Commit 112f343

Browse files
committed
More updates
1 parent caaabfd commit 112f343

File tree

7 files changed

+9
-200
lines changed

7 files changed

+9
-200
lines changed

components/metrics/README.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -74,18 +74,14 @@ metrics --component MyComponent --endpoint my_endpoint
7474
To run a more realistic deployment to gathering metrics from,
7575
see the examples in [examples/llm](../../examples/llm).
7676

77-
For example, for a VLLM + KV Routing based deployment that
78-
exposes statistics on an endpoint labeled
79-
`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
80-
with any other example such as examples/vllm_v0, vllm_v1, ...):
8177
```bash
82-
cd deploy/examples/llm
83-
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
78+
python -m dynamo.frontend &
79+
python -m dynamo.vllm --model-path <your-model-checkout>
8480
```
8581

8682
Then, to monitor the metrics of these VllmWorkers, run:
8783
```bash
88-
metrics --component VllmWorker --endpoint load_metrics
84+
metrics --component backend --endpoint load_metrics
8985
```
9086

9187
**NOTE**: `load_metrics` is currently a

docs/architecture/disagg_serving.md

Lines changed: 0 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,6 @@
22
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
33
All rights reserved.
44
SPDX-License-Identifier: Apache-2.0
5-
6-
Licensed under the Apache License, Version 2.0 (the "License");
7-
you may not use this file except in compliance with the License.
8-
You may obtain a copy of the License at
9-
10-
http://www.apache.org/licenses/LICENSE-2.0
11-
12-
Unless required by applicable law or agreed to in writing, software
13-
distributed under the License is distributed on an "AS IS" BASIS,
14-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15-
See the License for the specific language governing permissions and
16-
limitations under the License.
175
-->
186

197
# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
@@ -117,78 +105,3 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r
117105
- Add prefill worker: no explicit action needed.
118106
- Delete prefill worker: flush engine.
119107

120-
### How this works under the hood
121-
122-
#### Auto-Discovery for new workers
123-
124-
In Dynamo, we use `etcd` (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to `etcd` allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in `etcd`. Newly added prefill workers also register with `etcd` for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.
125-
126-
You can watch this happen live by running the following:
127-
128-
```bash
129-
# in terminal 1 - run the disaggregated serving example
130-
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
131-
```
132-
133-
```bash
134-
# in terminal 2 - watch the namespace in etcd
135-
watch -cd etcdctl get --prefix <namespace>
136-
```
137-
138-
You should see something like this show up as the disaggregated serving example starts up:
139-
140-
```bash
141-
# worker information
142-
dynamo/components/PrefillWorker/mock:694d967da694ea1e
143-
{
144-
"component": "PrefillWorker",
145-
"endpoint": "mock",
146-
"namespace": "dynamo",
147-
"lease_id": 7587886413599009310,
148-
"transport": {
149-
"nats_tcp": "dynamo_prefillworker_0d6df828.mock-694d967da694ea1e"
150-
}
151-
}
152-
dynamo/components/Processor/chat/completions:694d967da694ea16
153-
{
154-
"component": "Processor",
155-
"endpoint": "chat/completions",
156-
"namespace": "dynamo",
157-
"lease_id": 7587886413599009302,
158-
"transport": {
159-
"nats_tcp": "dynamo_processor_3816642d.chat/completions-694d967da694ea16"
160-
}
161-
}
162-
dynamo/components/VllmWorker/generate:694d967da694ea1a
163-
{
164-
"component": "VllmWorker",
165-
"endpoint": "generate",
166-
"namespace": "dynamo",
167-
"lease_id": 7587886413599009306,
168-
"transport": {
169-
"nats_tcp": "dynamo_vllmworker_3f6fafd3.generate-694d967da694ea1a"
170-
}
171-
}
172-
dynamo/components/VllmWorker/load_metrics:694d967da694ea1a
173-
{
174-
"component": "VllmWorker",
175-
"endpoint": "load_metrics",
176-
"namespace": "dynamo",
177-
"lease_id": 7587886413599009306,
178-
"transport": {
179-
"nats_tcp": "dynamo_vllmworker_3f6fafd3.load_metrics-694d967da694ea1a"
180-
}
181-
}
182-
183-
# nixl metadata
184-
dynamo/nixl_metadata/e318db87-be55-4c18-9829-8036e1e603e2
185-
```
186-
187-
#### Graceful worker shutdown
188-
189-
Since worker information is stored in etcd, we can shutdown workers by simply revoking their etcd leases. After a lease is revoked:
190-
191-
- Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
192-
- Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish
193-
194-
You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: [revoke_lease.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/revoke_lease.py).

docs/architecture/kv_cache_routing.md

Lines changed: 2 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@ SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All
33
SPDX-License-Identifier: Apache-2.0
44
-->
55

6-
## NEW
6+
# KV Cache Routing
7+
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
78

89
To enable KV cache aware routing start the frontend node like this:
910
```
@@ -22,26 +23,8 @@ The KV-aware routing arguments:
2223

2324
- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
2425

25-
---
26-
27-
## OLD - NEEDS MERGING WITH ABOVE
28-
29-
>[!NOTE]
30-
>This information is temporary and will change soon.
31-
32-
# KV Cache Routing
33-
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
3426

3527
## Architecture
36-
Dynamo's architecture consists of three key concepts:
37-
38-
- **Namespace**: Groups related components (similar to directories in a file system). In our examples, we use the label `dynamo`. This avoids collisions between two different dynamo graphs.
39-
- **Component**: The deployable unit in Dynamo. Components are self-contained and typically map to separate Docker containers. In our examples, we use labels like `VllmWorker `, `Router`, `Processor` for the components. Components can be created in Python or Rust.
40-
- **Endpoint**: Functions attached to components that transform inputs into outputs. Endpoints are discoverable and callable by other components. In our examples we use the label `generate` for most of the endpoints.
41-
42-
A Dynamo graph is a collection of components that are linked together to form a graph. There are two paths through the graphs. The request path and the response path. For LLMs the request path is single-in (a single message) and the response path is many-out (streamed output).
43-
44-
A common pattern is to spin up multiple of the same components that serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
4528

4629
Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.
4730

@@ -161,32 +144,6 @@ The KVIndexer builds and maintains a global view of cached blocks in a prefix tr
161144

162145
The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
163146

164-
Example:
165-
```python
166-
from dynamo.llm import KvIndexer
167-
from dynamo.sdk import dynamo_context
168-
169-
runtime = dynamo_context["runtime"]
170-
kv_listener = runtime.namespace("dynamo").component("VllmWorker")
171-
await kv_listener.create_service()
172-
173-
indexer = KvIndexer(kv_listener, block_size=16)
174-
indexer.find_matches_for_request([INPUT SEQUENCE OF TOKEN IDs])
175-
```
176-
177-
Sample Output:
178-
```
179-
{
180-
123456789: 10,
181-
987654321: 3,
182-
543219876: 7,
183-
}
184-
```
185-
186-
```{note}
187-
This example is designed to help you understand KV cache routing; it won't run outside of the context of dynamo serve. See the examples/ directory for runnable examples.
188-
```
189-
190147
### WorkerMetricsPublisher
191148
We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
192149
- num_requests_waiting
@@ -202,48 +159,3 @@ Currently, the WorkerMetricsPublisher exists as a Python binding.
202159
### KvMetricsAggregator
203160
The KvMetricsAggregator receives these metrics and aggregates them. It has a method `get_metrics` which returns an object of `AggregatedMetrics`.
204161

205-
Example:
206-
```python
207-
from dynamo.llm import KvMetricsAggregator
208-
from dynamo.sdk import dynamo_context
209-
210-
runtime = dynamo_context["runtime"]
211-
kv_listener = runtime.namespace("dynamo").component("VllmWorker")
212-
await kv_listener.create_service()
213-
metrics_aggregator = KvMetricsAggregator(kv_listener)
214-
215-
for endpoint in metrics_aggregator.get_metrics().endpoints:
216-
print("Worker ID: ", endpoint.worker_id)
217-
print("GPU Cache Usage: ", endpoint.gpu_cache_usage_perc)
218-
print("Number of Requests Waiting: ", endpoint.num_requests_waiting)
219-
print("GPU Prefix Cache Hit Rate: ", endpoint.gpu_prefix_cache_hit_rate)
220-
print("***")
221-
```
222-
223-
Sample Output:
224-
```
225-
Worker ID: 123456789
226-
GPU Cache Usage: 0.5
227-
Number of Requests Waiting: 2
228-
GPU Prefix Cache Hit Rate: 0.1
229-
***
230-
Worker ID: 987654321
231-
GPU Cache Usage: 0.5
232-
Number of Requests Waiting: 1
233-
GPU Prefix Cache Hit Rate: 0.1
234-
***
235-
```
236-
237-
```{note}
238-
This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
239-
```
240-
241-
### [KV Router](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/kv_router.py)
242-
The Router component makes intelligent worker selection decisions
243-
1. Receives incoming requests as tokens
244-
2. Queries the KVIndexer to find potential cache hits across workers
245-
3. Collects performance metrics from workers (via KvMetricsAggregator)
246-
4. Uses a cost function to determine the optimal worker for each request
247-
5. Returns chosen worker
248-
249-
The processor manages tokenizing the request, sending it to the KV Router and then once it receives a response, directs the request to the selected worker using direct() routing.

docs/guides/planner_benchmark/README.md

Lines changed: 4 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,6 @@
11
<!--
22
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
SPDX-License-Identifier: Apache-2.0
4-
5-
Licensed under the Apache License, Version 2.0 (the "License");
6-
you may not use this file except in compliance with the License.
7-
You may obtain a copy of the License at
8-
9-
http://www.apache.org/licenses/LICENSE-2.0
10-
11-
Unless required by applicable law or agreed to in writing, software
12-
distributed under the License is distributed on an "AS IS" BASIS,
13-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14-
See the License for the specific language governing permissions and
15-
limitations under the License.
164
-->
175

186
# Planner Benchmark Example
@@ -50,8 +38,8 @@ For other models and GPU SKUs, adjust the request rate ranges accordingly to mat
5038
To measure the performance of dynamo with planner, we start from a 1p1d deployment and set planner to make adjustments every 10 seconds:
5139

5240
```bash
53-
cd examples/llm
54-
dynamo serve graphs.disagg_router:Frontend -f disagg_1p1d.yml
41+
# Start Kubernetes with one frontend node, one prefill and one decode worker
42+
# TODO
5543

5644
# in terminal 2
5745
genai-perf profile \
@@ -84,7 +72,8 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
8472

8573
```bash
8674
# in terminal 1
87-
dynamo serve graphs.disagg_router:Frontend -f disagg_2p2d.yml
75+
# Start Kubernetes with one frontend node, two prefill and two decode workers
76+
# TODO
8877

8978
# in terminal 2
9079
genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl

lib/llm/src/discovery/watcher.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,6 @@ impl ModelWatcher {
178178
Some(card)
179179
}
180180
Err(err) => {
181-
// `dynamo serve` isn't using MDC yet so can't be an error
182181
tracing::info!(%err, "load_mdc did not complete");
183182
None
184183
}

0 commit comments

Comments
 (0)