Skip to content

Commit 05c3e27

Browse files
committed
draft
1 parent 41d7085 commit 05c3e27

File tree

2 files changed

+283
-0
lines changed

2 files changed

+283
-0
lines changed
Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
# Dynamo integration with Inference Gateway
2+
3+
**Status**: Draft
4+
5+
**Authors**: [Biswa Panda](https://github.com/biswapanda)
6+
7+
**Category**: Architecture
8+
9+
**Replaces**: [Link of previous proposal if applicable]
10+
11+
**Replaced By**: [Link of previous proposal if applicable]
12+
13+
**Sponsor**: [Name of code owner or maintainer to shepard process]
14+
15+
**Required Reviewers**: [Names of technical leads that are required for acceptance]
16+
17+
**Review Date**: [Date for review]
18+
19+
**Pull Request**: [Link to Pull Request of the Proposal itself]
20+
21+
**Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation]
22+
23+
# Summary
24+
25+
This proposal outlines the integration of Dynamo components with the Gateway API Inference Extension.
26+
27+
The current Inference Gateway is tightly coupled with model's tokenizer. However use cases require:
28+
1. **External Tokenization**: Preprocessing requests outside the gateway for specialized tokenization logic
29+
2. **KV-Aware Routing**: Intelligent routing based on prefix cache status and token analysis
30+
3. **Flexible side channel to offload tokens**: Support for both external cache and direct token passing strategies. This would be helpful for transfering large blob of tokens for VLMs (image/audio/video tokens)
31+
4. **Unified Dynamo Architecture**: Consolidated deployment model for all processing components
32+
33+
## Terminology & Definitions
34+
35+
| Term | Definition |
36+
| :---- | :---- |
37+
| **Dynamo EPP** | Enhanced Endpoint Picker Protocol service with Dynamo integration |
38+
| **Dynamo Processor** | Dynamo component responsible for request tokenization and preprocessing |
39+
| **Dynamo Router** | Dynamo component responsible for KV aware Routing strategy |
40+
| **Token Cache / Side Channel** | External storage system for tokenized request |
41+
42+
## Acronyms & Abbreviations
43+
44+
**EPP:** Endpoint Picker Protocol
45+
**IGW:** Inference Gateway
46+
47+
## Goals
48+
49+
* Integrate Dynamo Processor for request preprocessing and tokenization
50+
* Enable KV-aware routing through Dynamo Router Service
51+
* Support flexible token management (cache keys vs direct values)
52+
* Provide unified deployment architecture for all Dynamo components
53+
* Maintain backward compatibility with existing EPP functionality
54+
55+
### Non Goals
56+
57+
* Replace existing EPP internal scheduling completely
58+
* Modify core Gateway API specifications
59+
* Change existing worker pod interfaces significantly
60+
61+
## Requirements
62+
63+
### REQ 1 External Processing Integration
64+
65+
Dynamo EPP (Endpoint picker) **MUST** support calling LLM processors for request preprocessing and tokenization while maintaining the existing ext-proc interface.
66+
67+
### REQ 2 Flexible Routing Strategies
68+
69+
The system **SHOULD** support both external routing (via Dynamo Router) and internal EPP scheduling based on request configuration.
70+
71+
### REQ 3 Token offloading capability
72+
73+
The system **SHOULD** support both external cache-based token storage and direct token value passing to worker pods.
74+
75+
### REQ 4 Unified Dynamo Architecture
76+
77+
Dynamo EPP and components (Processor, Router, Workers) **MUST** be deployable as a unified dynamo graph within Kubernetes.
78+
79+
### REQ 5 Maintain compatibility with Inference Gateway protocols
80+
81+
Dynamo EPP **MUST** be compatible with Inference Gateway
82+
83+
# Proposal
84+
85+
## Architecture Overview
86+
87+
88+
The updated architecture unifies Inference Gateway with Dynamo Graph deployment. See [Architecture Diagram](./arch1.png) for detailed component interactions.
89+
90+
## Request Flow Sequence
91+
92+
```mermaid
93+
sequenceDiagram
94+
participant Client
95+
participant IGW as Inference Gateway<br/>(Envoy/kGateway)
96+
participant EPP as EPP Service<br/>(ext-proc/Endpoint Picker)
97+
participant ExtProcessor as (Dynamo) External LLM<br/>Processor
98+
participant Router as (Dynamo) Router
99+
participant TokenCache as External Token<br/>Cache/Side-channel
100+
participant Worker as (Dynamo) Worker<br/>Pod
101+
102+
Note over Client,Worker: Token Handling & Routing Strategies
103+
104+
%% Client Request
105+
Client->>IGW: POST /v1/chat/completions<br/>{"model": "llama-instruct",<br/> "messages": [...]<br/> }
106+
107+
IGW->>EPP: ext-proc: RequestHeaders
108+
EPP->>EPP: Parse model name from request<br/>Set X-Gateway-Model-Name header
109+
IGW->>EPP: ext-proc: RequestBody
110+
111+
%% Scenario 1: route=true (External routing via Router Service)
112+
alt route=true
113+
EPP->>ExtProcessor: POST /process<br/>{"request_body": {<br/> "model": "llama-instruct",<br/> "messages": [...],<br/> "route": true<br/> },<br/> "headers": {"x-request-id": "req-123"}}
114+
115+
ExtProcessor->>ExtProcessor: Tokenize prompt (always)<br/>Generate token_ids: [1, 15043, 29892, ...]
116+
117+
ExtProcessor->>Router: POST /route<br/>{"token_ids": [1, 15043, 29892, ...]}
118+
119+
Router->>Router: Apply KV aware routing:<br/>- Check prefix cache<br/>- Apply custom routing strategy<br/>- Select optimal worker
120+
121+
Router-->>ExtProcessor: Worker selection:<br/>{"worker_address": "worker-3:8080"}
122+
123+
%% Token storage decision
124+
alt Using External Cache
125+
ExtProcessor->>TokenCache: Store tokens<br/>Key: "cache_key_abc123"<br/>Value: [1, 15043, 29892, ...]
126+
127+
ExtProcessor-->>EPP: Response with token_key:<br/>{"worker_address": "worker-3:8080",<br/> "token_key": "cache_key_abc123"}
128+
129+
EPP->>EPP: Set x-gateway-destination-endpoint: "worker-3:8080"<br/>Set routing metadata
130+
EPP->>EPP: Prepare headers:<br/>x-req-tokens-key: "cache_key_abc123"
131+
132+
else Direct Token Values
133+
ExtProcessor-->>EPP: Response with token_value:<br/>{"worker_address": "worker-2:8080",<br/> "token_value": "[1,15043,29892,...]"}
134+
135+
EPP->>EPP: Set x-gateway-destination-endpoint: "worker-2:8080"<br/>Set routing metadata
136+
EPP->>EPP: Modify request body:<br/>Add "token_ids": [1,15043,29892,...]
137+
end
138+
139+
%% Scenario 2: route not specified (Internal routing)
140+
else route not specified
141+
Note over EPP: EPP schedules worker pods<br/>using internal logic
142+
143+
EPP->>ExtProcessor: POST /process<br/>{"request_body": {<br/> "model": "llama-instruct",<br/> "messages": [...]<br/> },<br/> "headers": {"x-request-id": "req-123"}}
144+
145+
ExtProcessor->>ExtProcessor: Tokenize prompt (always)
146+
147+
%% Allow external cache for internal routing too
148+
alt Store in External Cache
149+
ExtProcessor->>TokenCache: Store tokens<br/>Key: "cache_key_xyz789"<br/>Value: [1, 15043, 29892, ...]
150+
151+
ExtProcessor-->>EPP: Response with token_key:<br/>{"token_key": "cache_key_xyz789"}
152+
153+
EPP->>EPP: Schedule worker pods:<br/>- Apply internal scheduling<br/>- Check pod availability<br/>- Select: worker-pool-1:8080
154+
155+
EPP->>EPP: Set x-gateway-destination-endpoint: "worker-pool-1:8080"<br/>Set routing metadata
156+
EPP->>EPP: Prepare headers:<br/>x-req-tokens-key: "cache_key_xyz789"
157+
158+
else Direct Token Values
159+
ExtProcessor-->>EPP: Response with tokens:<br/>{"token_value": "[1,15043,29892,...]"}
160+
161+
EPP->>EPP: Schedule worker pods:<br/>- Apply internal scheduling<br/>- Select: worker-pool-1:8080
162+
163+
EPP->>EPP: Set x-gateway-destination-endpoint: "worker-pool-1:8080"<br/>Set routing metadata
164+
EPP->>EPP: Modify request body:<br/>Add "token_ids": [1,15043,29892,...]
165+
end
166+
end
167+
168+
EPP-->>IGW: ext-proc Response<br/>Header: x-gateway-destination-endpoint<br/>Header: x-req-tokens-key (if using cache)<br/>Modified request body (if direct tokens)
169+
170+
%% Request forwarding
171+
IGW->>Worker: HTTP Request to selected worker<br/>Header: x-gateway-destination-endpoint<br/>Header: x-req-tokens-key (if applicable)<br/>Body: includes token_ids (if direct)
172+
173+
alt Worker receives token_key in header
174+
Worker->>TokenCache: Fetch tokens<br/>Key from header: x-req-tokens-key
175+
TokenCache-->>Worker: Token array: [1,15043,29892,...]
176+
else Worker receives token_ids in body
177+
Worker->>Worker: Use token_ids directly<br/>from request body
178+
end
179+
180+
Worker->>Worker: LLM Inference with tokens
181+
Worker-->>IGW: Response<br/>{"choices": [...], "usage": {...}}
182+
183+
IGW-->>Client: Final Response
184+
```
185+
186+
# Implementation Details
187+
188+
## Key Components
189+
190+
### Dynamo EPP (ext-proc)
191+
- Integrates with Gateway via ext-proc protocol
192+
- Parses model names and sets `X-Gateway-Model-Name` header
193+
- Calls External LLM Processor for tokenization
194+
- Handles both external and internal routing strategies
195+
- Manages token key/value header and body modifications
196+
197+
### Dynamo Processor
198+
- Performs request tokenization
199+
- Supports both routing modes (external via Router, internal via EPP)
200+
- Manages token transfer strategies (cache vs direct)
201+
- Returns worker selection and dynamo backend framework (vLLM/Trtllm/sglang) agnostic request
202+
203+
### Dynamo Router Service
204+
- Implements KV-aware routing algorithms
205+
- Analyzes token_ids for optimal worker selection based on prefix cache
206+
- Called only when `route=true` is specified
207+
208+
### Dynamo Worker Pods
209+
- Perform LLM inference with preprocessed tokens
210+
- Support both token retrieval methods (cache keys, direct values)
211+
- Maintain compatibility with existing worker interfaces
212+
- exposes HTTP endpoint for direct intgerration with Inference gateway
213+
214+
### Token Cache / Side channel
215+
- External storage system which provides a Key/Value store interface transfer token_ids from processor to worker
216+
- Stores tokenized data with generated keys
217+
- Enables efficient token sharing between components
218+
- Optional component (direct token passing also supported)
219+
220+
## Configuration
221+
222+
### Environment Variables
223+
- `EXTERNAL_LLM_PROCESSOR_ENDPOINT`: Dynamo External LLM Processor URL
224+
- `USE_EXTERNAL_LLM_PROCESSOR`: Enable/disable external pre-processing (apply prompt templates/tokenization)
225+
- `USE_EXTERNAL_LLM_ROUTER`: Enable/disable external routing (in this case it's Dynamo Router)
226+
227+
### Headers
228+
- `X-Gateway-Model-Name`: Set by EPP from parsed model name in user request's body
229+
- `x-req-tokens-key`: Token cache key (when using external cache)
230+
- `x-req-tokens-value`: Direct token values (alternative to cache)
231+
232+
## Deferred to Implementation
233+
234+
- Specific token cache implementation details (Redis vs alternatives)
235+
- Fallback mechanisms for external service failures
236+
- Metrics and observability integration
237+
238+
# Implementation Phases
239+
240+
## Phase 1 Core Integration
241+
**Supported API / Behavior:**
242+
- External tokenization via Dynamo Processor
243+
- External scheduling/routiung using Dynamo Router
244+
- Direct token value passing to workers
245+
246+
**Not Supported:**
247+
- External cache-based token passing
248+
249+
## Phase 2 Tokens transfer thrugh side channel/cache
250+
**Supported API / Behavior:**
251+
- External cache-based token passing
252+
253+
# Related Proposals
254+
* Gateway API Inference Extension Architecture
255+
* EPP Architecture Proposal
256+
* Model Server Protocol
257+
258+
# Alternate Solutions
259+
260+
## Alt 1 Direct Tokenizer Integration in EPP (current EPP architecture)
261+
262+
**Pros:**
263+
- Simpler architecture without additional layer
264+
- Lower latency for request processing
265+
- Fewer network hops
266+
267+
**Cons:**
268+
- Less flexible for different models
269+
- Harder to maintain separation of concerns
270+
271+
**Reason Rejected:**
272+
- Violates Gateway API integration principles
273+
- Reduces portability across models
274+
- Increases complexity/TCO by using golang based tokenizer
275+
276+
## Alt 2 Sidecar Pattern
277+
- TODO
278+
279+
## References
280+
281+
* [Gateway API Inference Extension Documentation](https://gateway-api-inference-extension.sigs.k8s.io/)
282+
* [Envoy External Processing Filter](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)
283+
* [Gateway API Specification](https://gateway-api.sigs.k8s.io/)
112 KB
Loading

0 commit comments

Comments
 (0)