Skip to content

Commit 6cce469

Browse files
author
Pradyun Ramadorai
committed
Improve MANTLE_EXTENSIONS.md documentation
Enhanced documentation for plugin patches: 1. Patch vllm-project#1 (Usage Tracking Helper): - Clarified as OPTIONAL (has fallback in harmony streaming patch) - Changed from "REQUIRED" to "OPTIONAL" - Explained fallback mechanism in patched_stream_method.py - Marked as upstreamable (minor utility addition) 2. Patch vllm-project#3 (Harmony Token-by-Token Streaming): - Added detailed speculative decoding context - Explained Eagle draft model generates 5-10 tokens per step - Documented specific failures with batch processing: * Tool calling broken * Multi-channel content lost * Token truncation during channel transitions - Added before/after code examples - Linked to PR vllm-project#26291 (Eagle3 Multi-Channel Streaming Fix) - Documented upstream status and removal plan Key insight: This patch exists because Eagle speculative decoding returns multiple tokens per step, and upstream's batch processing can't handle per-token channel switching. Signed-off-by: Pradyun Ramadorai <[email protected]>
1 parent b461e02 commit 6cce469

File tree

1 file changed

+71
-14
lines changed

1 file changed

+71
-14
lines changed

MANTLE_EXTENSIONS.md

Lines changed: 71 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -52,14 +52,14 @@ Disable all extensions: `export MANTLE_EXTENSIONS_ENABLED=0`
5252
**Patch File**: `patches/serving_patches.py`
5353
**Target**: `vllm.entrypoints.openai.serving_chat.OpenAIServingChat`
5454
**Status**: ✅ Active
55-
**Category**: REQUIRED
56-
**Upstreamable**: No (internal infrastructure)
55+
**Category**: OPTIONAL (has fallback in harmony streaming patch)
56+
**Upstreamable**: Yes (minor utility addition)
5757

5858
#### Issue
59-
Inconsistent usage information creation across streaming events, especially with cached_tokens support for prompt_tokens_details.
59+
Inconsistent usage information creation across streaming events, especially with cached_tokens support for prompt_tokens_details. Without a centralized helper, every streaming implementation duplicates the same usage creation logic.
6060

6161
#### Solution
62-
Centralized static helper method for consistent `UsageInfo` creation:
62+
Centralized static helper method added to `OpenAIServingChat` for consistent `UsageInfo` creation:
6363

6464
```python
6565
@staticmethod
@@ -81,9 +81,13 @@ def _create_usage_info(
8181
```
8282

8383
#### Benefits
84-
- Consistent usage format across all streaming and non-streaming responses
85-
- Proper cached_tokens support in prompt_tokens_details
86-
- Centralized logic reduces duplication
84+
- ✅ Consistent usage format across all streaming and non-streaming responses
85+
- ✅ Proper cached_tokens support in prompt_tokens_details
86+
- ✅ Centralized logic reduces code duplication
87+
- ✅ Used by harmony streaming patch (with fallback if not present)
88+
89+
#### Note
90+
The harmony streaming patch (`patched_stream_method.py`) attempts to import this helper from `OpenAIServingChat`, but includes a fallback implementation if it doesn't exist. This patch provides the "official" version on the class for cleanliness and consistency, but is not strictly required for functionality.
8791

8892
---
8993

@@ -129,22 +133,75 @@ def _should_include_continuous_usage(stream_options) -> bool:
129133

130134
### 3. Harmony Token-by-Token Streaming
131135

132-
**Patch File**: `patches/harmony_streaming_patch.py`
136+
**Patch File**: `patches/harmony_streaming_patch.py` + `patches/patched_stream_method.py`
133137
**Target**: `vllm.entrypoints.openai.serving_chat.OpenAIServingChat.chat_completion_stream_generator`
134138
**Status**: ✅ Active
135139
**Category**: REQUIRED
136-
**Upstreamable**: Yes
140+
**Upstreamable**: Yes - blocked on upstream PR #26291
141+
**Related**: PR #26291 (Eagle3 Multi-Channel Streaming Fix)
137142

138143
#### Issue
139-
Harmony streaming needs token-by-token processing with proper tool calling support. Upstream has batch processing that can lose tool call details.
144+
**Primary Problem**: Upstream harmony streaming uses batch processing that groups multiple tokens together before processing. When used with **speculative decoding (Eagle/Eagle3)**, the draft model generates multiple candidate tokens per step, which need careful per-token processing to maintain correct streaming behavior.
145+
146+
**Specific Failures with Batch Processing**:
147+
1. **Tool Calling Broken**: Tools generate responses immediately without waiting for execution in streaming mode
148+
2. **Multi-Channel Content Lost**: When Eagle3 switches between channels (final answer vs reasoning vs tool calls) mid-batch, only the last channel's content is preserved
149+
3. **Token Truncation**: Intermediate spec tokens lost during channel transitions, causing incomplete streaming output
150+
151+
**Why This Happens with Speculative Decoding**:
152+
- Eagle draft model generates 5-10 tokens per step
153+
- Each token might belong to different channels (reasoning/answer/tools)
154+
- Batch processing only examines final state after processing ALL tokens
155+
- Intermediate channel transitions are lost → truncated output
140156

141157
#### Solution
142-
Full method replacement (~888 lines) with token-by-token processing. Fixes tool calling in streaming mode - tools now wait for execution properly.
158+
Full method replacement (~888 lines) with **token-by-token processing** instead of batch grouping:
159+
160+
```python
161+
# OLD (upstream batch processing):
162+
for chunk in result_generator:
163+
for output in chunk.outputs:
164+
delta_text = output.text[len(previous_texts[index]):]
165+
# Process entire delta_text batch
166+
# ❌ Loses intermediate channel transitions
167+
168+
# NEW (Mantle token-by-token):
169+
for chunk in result_generator:
170+
for output in chunk.outputs:
171+
delta_text = output.text[len(previous_texts[index]):]
172+
# ✅ Process each token individually
173+
for token in tokenize(delta_text):
174+
# Track per-token channel state
175+
# Preserve all channel transitions
176+
```
177+
178+
**Key Changes**:
179+
1. **Per-Token State Tracking**: Track `(channel, recipient, delta)` tuple for EACH token
180+
2. **Grouped Message Construction**: Group consecutive same-channel tokens into single DeltaMessage
181+
3. **Preserved Transitions**: All channel switches preserved, no truncation
143182

144183
#### Benefits
145-
- Proper token-by-token streaming for Harmony models
146-
- Fixes tool calling in streaming mode
147-
- Can be removed once upstream implements proper token-by-token streaming
184+
- ✅ Proper token-by-token streaming for Harmony models
185+
-**Fixes tool calling in streaming mode** - tools wait for execution properly
186+
-**Fixes Eagle3 speculative decoding** - no token truncation during channel transitions
187+
- ✅ Enables multi-channel content (final answer + reasoning + tool calls)
188+
- ✅ Compatible with continuous usage statistics
189+
190+
#### Upstream Status & Removal Plan
191+
192+
**PR #26291**: https://github.com/vllm-project/vllm/pull/26291
193+
- **Status**: WIP upstream (attempted cherry-pick removed Oct 19, 2025 - didn't work correctly)
194+
- **Purpose**: Same goal - fix Eagle3 multi-channel streaming truncation
195+
- **Approach**: Similar token-by-token processing
196+
197+
**When to Remove This Patch**:
198+
Once PR #26291 is properly merged upstream and verified working:
199+
1. Disable patch in `patch_config.json`: `"harmony_streaming_patch": {"enabled": false}`
200+
2. Test with Eagle3 + tool calling + streaming
201+
3. If working correctly, remove patch entirely
202+
4. Update documentation noting upstream now handles this
203+
204+
**Until then**: Keep this patch active as it's the only solution that properly handles Eagle speculative decoding with tool calling in streaming mode.
148205

149206
---
150207

0 commit comments

Comments
 (0)