Skip to content

Conversation

ServeurpersoCom
Copy link
Collaborator

@ServeurpersoCom ServeurpersoCom commented Oct 2, 2025

Refactored try_parse_reasoning() to handle incremental parsing during streaming:
Parser improvements:

Tracks partial tag detection (e.g., when stream cuts mid-tag like <thi...)
Handles multiple consecutive reasoning segments within a single response
Preserves leading whitespace while detecting reasoning blocks
Continues processing content after closes instead of stopping early
Works correctly with thinking_forced_open flag for grammar-constrained generation

Integration changes:

Modified common_chat_parse_content_only() and common_chat_parse_llama_3_1() to invoke reasoning parsing before tool call handling
Changed default reasoning_format from auto to deepseek for consistent behavior
Added deepseek-legacy option for backwards compatibility (inlines tags in content)

Benefits

Clients no longer need custom CoT parsing logic for streaming mode
Consistent API behavior: reasoning_content and content properly separated in both streaming and non-streaming modes
Simplifies webui and SDK implementations
Universal: works across all reasoning formats, not just DeepSeek

When generation is launched from a template that ends the system prompt with and thinking is enabled, the template sets thinking_forced_open = true; the parser then consumes all text preceding as reasoning, even if the opening tag is never emitted by the model, as validated by the REASONINGCONTENT test.

Without --jinja, the server prohibits the use of tools/tool_choice, sets inputs.use_jinja = false, and computes enable_thinking = false. Templates therefore immediately close the tag and no longer force the opening, which disables the reasoning/content separation required by our new scenarios (the parser is still called on the stream but will only see regular content unless the model itself emits complete tags).

Testing
Added parser tests covering:

Inline ... segments in CONTENT_ONLY format
Inline reasoning in LLAMA_3_X format
Multiple reasoning blocks in single response
Partial tag detection during streaming

@ServeurpersoCom ServeurpersoCom force-pushed the streaming-aware-cpp-parser branch 2 times, most recently from 84c5532 to 928af2e Compare October 3, 2025 14:31
@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Oct 4, 2025

Sure, you can squint at curl -N chunk dumps, but this integrated UI turns that pain into a proper workflow: showing the raw wire stream (no backend parsing / reasoning_format=none, no frontend Markdown / HTML <pre>) so you can actually inspect behavior across models in real time, all in one click, with a tiny commit.

Sans titre Spying GPT-OSS : oss

@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Oct 4, 2025

I have tested all the reasoning (and non-reasoning) models in my collection multiple times, including some tool call testing, without encountering any parsing bugs. I’d like to test more models. :)

unsloth/OLMo-2-0325-32B-Instruct-GGUF/OLMo-2-0325-32B-Instruct-Q6_K.gguf
unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/Mistral-Small-3.2-24B-Instruct-2506-Q6_K.gguf
unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q6_K.gguf
bartowski/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q8_0.gguf
mradermacher/BlackSheep-24B-i1-GGUF/BlackSheep-24B.Q8_0.gguf
mradermacher/XortronCriminalComputingConfig-i1-GGUF/XortronCriminalComputingConfig.Q8_0.gguf
bartowski/TheDrummer_Cydonia-24B-v4.1-GGUF/TheDrummer_Cydonia-24B-v4.1-Q8_0.gguf
unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-Q6_K.gguf
mradermacher/Codestral-22B-v0.1-i1-GGUF/Codestral-22B-v0.1.Q8_0.gguf
unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q6_K.gguf
bartowski/TheDrummer_Big-Tiger-Gemma-27B-v3-GGUF/TheDrummer_Big-Tiger-Gemma-27B-v3-Q6_K.gguf
unsloth/Seed-OSS-36B-Instruct-GGUF/Seed-OSS-36B-Instruct-Q5_K_M.gguf
mradermacher/deepseek-coder-33b-instruct-i1-GGUF/deepseek-coder-33b-instruct.i1-Q6_K.gguf
unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf
mradermacher/aya-expanse-32b-i1-GGUF/aya-expanse-32b.i1-Q6_K.gguf
unsloth/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf
unsloth/GLM-Z1-32B-0414-GGUF/GLM-Z1-32B-0414-Q6_K.gguf Need #16426 NOK need another approach ( multiple specific cases inside the generic fallback to detect trailing tags in Jinja templates )
unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q4_K_M-00001-of-00002.gguf
bartowski/TheDrummer_GLM-Steam-106B-A12B-v1-GGUF/TheDrummer_GLM-Steam-106B-A12B-v1-Q4_K_M-00001-of-00002.gguf
mradermacher/EXAONE-4.0.1-32B-i1-GGUF/EXAONE-4.0.1-32B.i1-Q6_K.gguf
unsloth/QwQ-32B-GGUF/QwQ-32B-Q6_K.gguf
mradermacher/Qwen3-32B-i1-GGUF/Qwen3-32B.i1-Q6_K.gguf
unsloth/Qwen2.5-VL-32B-Instruct-GGUF/Qwen2.5-VL-32B-Instruct-Q5_K_M.gguf
mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-Q6_K.gguf
mradermacher/Qwen3-30B-A3B-Thinking-2507-i1-GGUF/Qwen3-30B-A3B-Thinking-2507.i1-Q6_K.gguf
lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf
lmstudio-community/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf
unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-Q4_K_M-00001-of-00002.gguf
unsloth/Llama-3_3-Nemotron-Super-49B-v1_5-GGUF/Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
unsloth/OpenReasoning-Nemotron-32B-GGUF/OpenReasoning-Nemotron-32B-Q6_K.gguf
bartowski/TheDrummer_Valkyrie-49B-v2-GGUF/TheDrummer_Valkyrie-49B-v2-IQ4_NL.gguf
mradermacher/K2-Think-i1-GGUF/K2-Think.i1-Q6_K.gguf

ServeurpersoCom and others added 7 commits October 6, 2025 13:34
…p frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages
Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.
- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <[email protected]>
…ll tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows
- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering
@ServeurpersoCom ServeurpersoCom force-pushed the streaming-aware-cpp-parser branch from 32212ff to 1e6beb6 Compare October 6, 2025 11:35
…ormat toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example
@allozaur
Copy link
Collaborator

allozaur commented Oct 6, 2025

@ServeurpersoCom unfortunately applying GH code suggestions bypasses the pre-commit hooks that format the code, so please run npm run format and push formatted code. Thank you! 🙏

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing and haven't spotted any issues. Let's wait for @ngxson approval and merge.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good over all, just some small comments

@allozaur
Copy link
Collaborator

allozaur commented Oct 7, 2025

@ServeurpersoCom I've solved one conflict that appeared after merging #16282. Let's have some final testing round, build the fresh static output and after the manual tests and CI pass, let's ship this 🚀😄

@ggerganov
Copy link
Member

@allozaur Good to merge?

@ServeurpersoCom
Copy link
Collaborator Author

Yes for this one !

@ServeurpersoCom
Copy link
Collaborator Author

This PR reuses the existing logic from the current non-streaming implementation for reasoning content separation.
The risk is low in the worst case, a closing tag might get escaped inside the content.
Clients will be easy to adapt, and their code will actually become cleaner.
It’s a good practice to establish as early as possible.
At worst, older clients will simply omit the reasoning content without breaking.

In contrast, PR #16426 requires a different approach.
I marked it as draft because it’s riskier I’ll need to implement proper fallback paths to ensure that one model doesn’t break another.

@allozaur
Copy link
Collaborator

allozaur commented Oct 8, 2025

Okay, in this case let's ship it!

@ggerganov
Copy link
Member

I think we need to build the static output after the last merge?

@allozaur
Copy link
Collaborator

allozaur commented Oct 8, 2025

Ah right, it still hasn't been updated, @ServeurpersoCom if u'd be so kind :)

@ServeurpersoCom
Copy link
Collaborator Author

I’m ready to re-check master alone as soon as it’s merged ✅

@ggerganov ggerganov merged commit 12bbc3f into ggml-org:master Oct 8, 2025
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants