You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Category**: OPTIONAL (has fallback in harmony streaming patch)
56
+
**Upstreamable**: Yes (minor utility addition)
57
57
58
58
#### Issue
59
-
Inconsistent usage information creation across streaming events, especially with cached_tokens support for prompt_tokens_details.
59
+
Inconsistent usage information creation across streaming events, especially with cached_tokens support for prompt_tokens_details. Without a centralized helper, every streaming implementation duplicates the same usage creation logic.
60
60
61
61
#### Solution
62
-
Centralized static helper method for consistent `UsageInfo` creation:
62
+
Centralized static helper method added to `OpenAIServingChat`for consistent `UsageInfo` creation:
63
63
64
64
```python
65
65
@staticmethod
@@ -81,9 +81,13 @@ def _create_usage_info(
81
81
```
82
82
83
83
#### Benefits
84
-
- Consistent usage format across all streaming and non-streaming responses
85
-
- Proper cached_tokens support in prompt_tokens_details
86
-
- Centralized logic reduces duplication
84
+
- ✅ Consistent usage format across all streaming and non-streaming responses
85
+
- ✅ Proper cached_tokens support in prompt_tokens_details
86
+
- ✅ Centralized logic reduces code duplication
87
+
- ✅ Used by harmony streaming patch (with fallback if not present)
88
+
89
+
#### Note
90
+
The harmony streaming patch (`patched_stream_method.py`) attempts to import this helper from `OpenAIServingChat`, but includes a fallback implementation if it doesn't exist. This patch provides the "official" version on the class for cleanliness and consistency, but is not strictly required for functionality.
Harmony streaming needs token-by-token processing with proper tool calling support. Upstream has batch processing that can lose tool call details.
144
+
**Primary Problem**: Upstream harmony streaming uses batch processing that groups multiple tokens together before processing. When used with **speculative decoding (Eagle/Eagle3)**, the draft model generates multiple candidate tokens per step, which need careful per-token processing to maintain correct streaming behavior.
145
+
146
+
**Specific Failures with Batch Processing**:
147
+
1.**Tool Calling Broken**: Tools generate responses immediately without waiting for execution in streaming mode
148
+
2.**Multi-Channel Content Lost**: When Eagle3 switches between channels (final answer vs reasoning vs tool calls) mid-batch, only the last channel's content is preserved
149
+
3.**Token Truncation**: Intermediate spec tokens lost during channel transitions, causing incomplete streaming output
150
+
151
+
**Why This Happens with Speculative Decoding**:
152
+
- Eagle draft model generates 5-10 tokens per step
153
+
- Each token might belong to different channels (reasoning/answer/tools)
154
+
- Batch processing only examines final state after processing ALL tokens
155
+
- Intermediate channel transitions are lost → truncated output
140
156
141
157
#### Solution
142
-
Full method replacement (~888 lines) with token-by-token processing. Fixes tool calling in streaming mode - tools now wait for execution properly.
158
+
Full method replacement (~888 lines) with **token-by-token processing** instead of batch grouping:
0 commit comments