Improve MANTLE_EXTENSIONS.md documentation

Pradyun Ramadorai · Pradyun Ramadorai · commit 6cce4698e4ed · 2025-10-19T09:31:16.000Z
Enhanced documentation for plugin patches: 1. Patch vllm-project#1 (Usage Tracking Helper): - Clarified as OPTIONAL (has fallback in harmony streaming patch) - Changed from "REQUIRED" to "OPTIONAL" - Explained fallback mechanism in patched_stream_method.py - Marked as upstreamable (minor utility addition) 2. Patch vllm-project#3 (Harmony Token-by-Token Streaming): - Added detailed speculative decoding context - Explained Eagle draft model generates 5-10 tokens per step - Documented specific failures with batch processing: * Tool calling broken * Multi-channel content lost * Token truncation during channel transitions - Added before/after code examples - Linked to PR vllm-project#26291 (Eagle3 Multi-Channel Streaming Fix) - Documented upstream status and removal plan Key insight: This patch exists because Eagle speculative decoding returns multiple tokens per step, and upstream's batch processing can't handle per-token channel switching. Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
diff --git a/MANTLE_EXTENSIONS.md b/MANTLE_EXTENSIONS.md
@@ -52,14 +52,14 @@ Disable all extensions: `export MANTLE_EXTENSIONS_ENABLED=0`
 **Patch File**: `patches/serving_patches.py`
 **Target**: `vllm.entrypoints.openai.serving_chat.OpenAIServingChat`
 **Status**: ✅ Active
-**Category**: REQUIRED
-**Upstreamable**: No (internal infrastructure)
+**Category**: OPTIONAL (has fallback in harmony streaming patch)
+**Upstreamable**: Yes (minor utility addition)
 
 #### Issue
-Inconsistent usage information creation across streaming events, especially with cached_tokens support for prompt_tokens_details.
+Inconsistent usage information creation across streaming events, especially with cached_tokens support for prompt_tokens_details. Without a centralized helper, every streaming implementation duplicates the same usage creation logic.
 
 #### Solution
-Centralized static helper method for consistent `UsageInfo` creation:
+Centralized static helper method added to `OpenAIServingChat` for consistent `UsageInfo` creation:
 
 ```python
 @staticmethod
@@ -81,9 +81,13 @@ def _create_usage_info(
 ```
 
 #### Benefits
-- Consistent usage format across all streaming and non-streaming responses
-- Proper cached_tokens support in prompt_tokens_details
-- Centralized logic reduces duplication
+- ✅ Consistent usage format across all streaming and non-streaming responses
+- ✅ Proper cached_tokens support in prompt_tokens_details
+- ✅ Centralized logic reduces code duplication
+- ✅ Used by harmony streaming patch (with fallback if not present)
+
+#### Note
+The harmony streaming patch (`patched_stream_method.py`) attempts to import this helper from `OpenAIServingChat`, but includes a fallback implementation if it doesn't exist. This patch provides the "official" version on the class for cleanliness and consistency, but is not strictly required for functionality.
 
 ---
 
@@ -129,22 +133,75 @@ def _should_include_continuous_usage(stream_options) -> bool:
 
 ### 3. Harmony Token-by-Token Streaming
 
-**Patch File**: `patches/harmony_streaming_patch.py`
+**Patch File**: `patches/harmony_streaming_patch.py` + `patches/patched_stream_method.py`
 **Target**: `vllm.entrypoints.openai.serving_chat.OpenAIServingChat.chat_completion_stream_generator`
 **Status**: ✅ Active
 **Category**: REQUIRED
-**Upstreamable**: Yes
+**Upstreamable**: Yes - blocked on upstream PR #26291
+**Related**: PR #26291 (Eagle3 Multi-Channel Streaming Fix)
 
 #### Issue
-Harmony streaming needs token-by-token processing with proper tool calling support. Upstream has batch processing that can lose tool call details.
+**Primary Problem**: Upstream harmony streaming uses batch processing that groups multiple tokens together before processing. When used with **speculative decoding (Eagle/Eagle3)**, the draft model generates multiple candidate tokens per step, which need careful per-token processing to maintain correct streaming behavior.
+
+**Specific Failures with Batch Processing**:
+1. **Tool Calling Broken**: Tools generate responses immediately without waiting for execution in streaming mode
+2. **Multi-Channel Content Lost**: When Eagle3 switches between channels (final answer vs reasoning vs tool calls) mid-batch, only the last channel's content is preserved
+3. **Token Truncation**: Intermediate spec tokens lost during channel transitions, causing incomplete streaming output
+
+**Why This Happens with Speculative Decoding**:
+- Eagle draft model generates 5-10 tokens per step
+- Each token might belong to different channels (reasoning/answer/tools)
+- Batch processing only examines final state after processing ALL tokens
+- Intermediate channel transitions are lost → truncated output
 
 #### Solution
-Full method replacement (~888 lines) with token-by-token processing. Fixes tool calling in streaming mode - tools now wait for execution properly.
+Full method replacement (~888 lines) with **token-by-token processing** instead of batch grouping:
+
+```python
+# OLD (upstream batch processing):
+for chunk in result_generator:
+    for output in chunk.outputs:
+        delta_text = output.text[len(previous_texts[index]):]
+        # Process entire delta_text batch
+        # ❌ Loses intermediate channel transitions
+
+# NEW (Mantle token-by-token):
+for chunk in result_generator:
+    for output in chunk.outputs:
+        delta_text = output.text[len(previous_texts[index]):]
+        # ✅ Process each token individually
+        for token in tokenize(delta_text):
+            # Track per-token channel state
+            # Preserve all channel transitions
+```
+
+**Key Changes**:
+1. **Per-Token State Tracking**: Track `(channel, recipient, delta)` tuple for EACH token
+2. **Grouped Message Construction**: Group consecutive same-channel tokens into single DeltaMessage
+3. **Preserved Transitions**: All channel switches preserved, no truncation
 
 #### Benefits
-- Proper token-by-token streaming for Harmony models
-- Fixes tool calling in streaming mode
-- Can be removed once upstream implements proper token-by-token streaming
+- ✅ Proper token-by-token streaming for Harmony models
+- ✅ **Fixes tool calling in streaming mode** - tools wait for execution properly
+- ✅ **Fixes Eagle3 speculative decoding** - no token truncation during channel transitions
+- ✅ Enables multi-channel content (final answer + reasoning + tool calls)
+- ✅ Compatible with continuous usage statistics
+
+#### Upstream Status & Removal Plan
+
+**PR #26291**: https://github.com/vllm-project/vllm/pull/26291
+- **Status**: WIP upstream (attempted cherry-pick removed Oct 19, 2025 - didn't work correctly)
+- **Purpose**: Same goal - fix Eagle3 multi-channel streaming truncation
+- **Approach**: Similar token-by-token processing
+
+**When to Remove This Patch**:
+Once PR #26291 is properly merged upstream and verified working:
+1. Disable patch in `patch_config.json`: `"harmony_streaming_patch": {"enabled": false}`
+2. Test with Eagle3 + tool calling + streaming
+3. If working correctly, remove patch entirely
+4. Update documentation noting upstream now handles this
+
+**Until then**: Keep this patch active as it's the only solution that properly handles Eagle speculative decoding with tool calling in streaming mode.
 
 ---