refactor: centralize CoT parsing in backend for streaming mode #16394

ServeurpersoCom · 2025-10-02T21:17:30Z

Refactored try_parse_reasoning() to handle incremental parsing during streaming:
Parser improvements:

Tracks partial tag detection (e.g., when stream cuts mid-tag like <thi...)
Handles multiple consecutive reasoning segments within a single response
Preserves leading whitespace while detecting reasoning blocks
Continues processing content after closes instead of stopping early
Works correctly with thinking_forced_open flag for grammar-constrained generation

Integration changes:

Modified common_chat_parse_content_only() and common_chat_parse_llama_3_1() to invoke reasoning parsing before tool call handling
Changed default reasoning_format from auto to deepseek for consistent behavior
Added deepseek-legacy option for backwards compatibility (inlines tags in content)

Benefits

Clients no longer need custom CoT parsing logic for streaming mode
Consistent API behavior: reasoning_content and content properly separated in both streaming and non-streaming modes
Simplifies webui and SDK implementations
Universal: works across all reasoning formats, not just DeepSeek

When generation is launched from a template that ends the system prompt with and thinking is enabled, the template sets thinking_forced_open = true; the parser then consumes all text preceding as reasoning, even if the opening tag is never emitted by the model, as validated by the REASONINGCONTENT test.

Without --jinja, the server prohibits the use of tools/tool_choice, sets inputs.use_jinja = false, and computes enable_thinking = false. Templates therefore immediately close the tag and no longer force the opening, which disables the reasoning/content separation required by our new scenarios (the parser is still called on the stream but will only see regular content unless the model itself emits complete tags).

Testing
Added parser tests covering:

Inline ... segments in CONTENT_ONLY format
Inline reasoning in LLAMA_3_X format
Multiple reasoning blocks in single response
Partial tag detection during streaming

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

tools/server/webui/src/lib/stores/chat.svelte.ts

tools/server/webui/src/lib/services/chat.ts

tools/server/webui/src/stories/ChatMessage.stories.svelte

common/arg.cpp

ServeurpersoCom · 2025-10-04T05:45:11Z

Sure, you can squint at curl -N chunk dumps, but this integrated UI turns that pain into a proper workflow: showing the raw wire stream (no backend parsing / reasoning_format=none, no frontend Markdown / HTML <pre>) so you can actually inspect behavior across models in real time, all in one click, with a tiny commit.

Spying GPT-OSS :

ServeurpersoCom · 2025-10-04T11:45:23Z

I have tested all the reasoning (and non-reasoning) models in my collection multiple times, including some tool call testing, without encountering any parsing bugs. I’d like to test more models. :)

unsloth/OLMo-2-0325-32B-Instruct-GGUF/OLMo-2-0325-32B-Instruct-Q6_K.gguf
unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/Mistral-Small-3.2-24B-Instruct-2506-Q6_K.gguf
unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q6_K.gguf
bartowski/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q8_0.gguf
mradermacher/BlackSheep-24B-i1-GGUF/BlackSheep-24B.Q8_0.gguf
mradermacher/XortronCriminalComputingConfig-i1-GGUF/XortronCriminalComputingConfig.Q8_0.gguf
bartowski/TheDrummer_Cydonia-24B-v4.1-GGUF/TheDrummer_Cydonia-24B-v4.1-Q8_0.gguf
unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-Q6_K.gguf
mradermacher/Codestral-22B-v0.1-i1-GGUF/Codestral-22B-v0.1.Q8_0.gguf
unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q6_K.gguf
bartowski/TheDrummer_Big-Tiger-Gemma-27B-v3-GGUF/TheDrummer_Big-Tiger-Gemma-27B-v3-Q6_K.gguf
unsloth/Seed-OSS-36B-Instruct-GGUF/Seed-OSS-36B-Instruct-Q5_K_M.gguf
mradermacher/deepseek-coder-33b-instruct-i1-GGUF/deepseek-coder-33b-instruct.i1-Q6_K.gguf
unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf
mradermacher/aya-expanse-32b-i1-GGUF/aya-expanse-32b.i1-Q6_K.gguf
unsloth/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf
unsloth/GLM-Z1-32B-0414-GGUF/GLM-Z1-32B-0414-Q6_K.gguf Need #16426 NOK need another approach ( multiple specific cases inside the generic fallback to detect trailing tags in Jinja templates )
unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q4_K_M-00001-of-00002.gguf
bartowski/TheDrummer_GLM-Steam-106B-A12B-v1-GGUF/TheDrummer_GLM-Steam-106B-A12B-v1-Q4_K_M-00001-of-00002.gguf
mradermacher/EXAONE-4.0.1-32B-i1-GGUF/EXAONE-4.0.1-32B.i1-Q6_K.gguf
unsloth/QwQ-32B-GGUF/QwQ-32B-Q6_K.gguf
mradermacher/Qwen3-32B-i1-GGUF/Qwen3-32B.i1-Q6_K.gguf
unsloth/Qwen2.5-VL-32B-Instruct-GGUF/Qwen2.5-VL-32B-Instruct-Q5_K_M.gguf
mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-Q6_K.gguf
mradermacher/Qwen3-30B-A3B-Thinking-2507-i1-GGUF/Qwen3-30B-A3B-Thinking-2507.i1-Q6_K.gguf
lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf
lmstudio-community/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf
unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-Q4_K_M-00001-of-00002.gguf
unsloth/Llama-3_3-Nemotron-Super-49B-v1_5-GGUF/Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
unsloth/OpenReasoning-Nemotron-32B-GGUF/OpenReasoning-Nemotron-32B-Q6_K.gguf
bartowski/TheDrummer_Valkyrie-49B-v2-GGUF/TheDrummer_Valkyrie-49B-v2-IQ4_NL.gguf
mradermacher/K2-Think-i1-GGUF/K2-Think.i1-Q6_K.gguf

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

tools/server/webui/src/stories/ChatMessage.stories.svelte

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

…p frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty.

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <[email protected]>

…ll tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

…atMessage.svelte Co-authored-by: Aleksander Grygier <[email protected]>

…ormat toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

allozaur · 2025-10-06T14:10:55Z

@ServeurpersoCom unfortunately applying GH code suggestions bypasses the pre-commit hooks that format the code, so please run npm run format and push formatted code. Thank you! 🙏

ggerganov

I did some testing and haven't spotted any issues. Let's wait for @ngxson approval and merge.

ngxson

Looks good over all, just some small comments

common/chat-parser.cpp

Co-authored-by: Xuan Son Nguyen <[email protected]>

allozaur · 2025-10-07T09:16:44Z

@ServeurpersoCom I've solved one conflict that appeared after merging #16282. Let's have some final testing round, build the fresh static output and after the manual tests and CI pass, let's ship this 🚀😄

ggerganov · 2025-10-08T19:45:05Z

@allozaur Good to merge?

ServeurpersoCom · 2025-10-08T19:52:53Z

Yes for this one !

ServeurpersoCom · 2025-10-08T20:02:19Z

This PR reuses the existing logic from the current non-streaming implementation for reasoning content separation.
The risk is low in the worst case, a closing tag might get escaped inside the content.
Clients will be easy to adapt, and their code will actually become cleaner.
It’s a good practice to establish as early as possible.
At worst, older clients will simply omit the reasoning content without breaking.

In contrast, PR #16426 requires a different approach.
I marked it as draft because it’s riskier I’ll need to implement proper fallback paths to ensure that one model doesn’t break another.

allozaur · 2025-10-08T20:04:28Z

Okay, in this case let's ship it!

ggerganov · 2025-10-08T20:05:02Z

I think we need to build the static output after the last merge?

allozaur · 2025-10-08T20:05:55Z

Ah right, it still hasn't been updated, @ServeurpersoCom if u'd be so kind :)

ServeurpersoCom · 2025-10-08T20:14:28Z

I’m ready to re-check master alone as soon as it’s merged ✅

ServeurpersoCom requested review from allozaur, ngxson and ggerganov as code owners October 2, 2025 21:17

github-actions bot added testing Everything test related examples server labels Oct 2, 2025

ServeurpersoCom marked this pull request as draft October 2, 2025 21:17

ServeurpersoCom mentioned this pull request Oct 2, 2025

Add support to ◁think▷...◁/think▷ format and DRY the thinking processing logic #16364

Closed

ServeurpersoCom marked this pull request as ready for review October 3, 2025 02:03

allozaur assigned ServeurpersoCom Oct 3, 2025

allozaur requested changes Oct 3, 2025

View reviewed changes

ngxson reviewed Oct 3, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

ServeurpersoCom force-pushed the streaming-aware-cpp-parser branch 2 times, most recently from 84c5532 to 928af2e Compare October 3, 2025 14:31

ServeurpersoCom mentioned this pull request Oct 4, 2025

fix: add generic fallback to detect trailing <think> tags in Jinja templates and handle forced-open reasoning blocks #16426

Draft

tommarques56 reviewed Oct 4, 2025

View reviewed changes

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Show resolved Hide resolved

allozaur requested changes Oct 6, 2025

View reviewed changes

tools/server/webui/src/stories/ChatMessage.stories.svelte Show resolved Hide resolved

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Show resolved Hide resolved

ServeurpersoCom and others added 7 commits October 6, 2025 13:34

refactor: address review feedback from ngxson

a2fdf42

debug: say goodbye to curl -N, hello one-click raw stream

3557493

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

Update tools/server/webui/src/lib/components/app/chat/ChatMessages/Ch…

1e6beb6

…atMessage.svelte Co-authored-by: Aleksander Grygier <[email protected]>

ServeurpersoCom force-pushed the streaming-aware-cpp-parser branch from 32212ff to 1e6beb6 Compare October 6, 2025 11:35

allozaur approved these changes Oct 6, 2025

View reviewed changes

npm run format

73ae73b

allozaur approved these changes Oct 6, 2025

View reviewed changes

ggerganov approved these changes Oct 6, 2025

View reviewed changes

ngxson reviewed Oct 6, 2025

View reviewed changes

common/chat-parser.cpp Outdated Show resolved Hide resolved

common/chat-parser.cpp Show resolved Hide resolved

chat-parser: address review feedback from ngxson

9074d04

Co-authored-by: Xuan Son Nguyen <[email protected]>

ngxson approved these changes Oct 6, 2025

View reviewed changes

Merge branch 'master' into streaming-aware-cpp-parser

cbeb01e

allozaur mentioned this pull request Oct 8, 2025

Eval bug: reasoning content isn't extracted into "reasoning_content" field for GLM-4.5/GLM-4.6 #16439

Closed

ggerganov merged commit 12bbc3f into ggml-org:master Oct 8, 2025
68 checks passed

refactor: centralize CoT parsing in backend for streaming mode #16394

refactor: centralize CoT parsing in backend for streaming mode #16394

Conversation

ServeurpersoCom commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ServeurpersoCom commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allozaur commented Oct 6, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

allozaur commented Oct 7, 2025

Uh oh!

ggerganov commented Oct 8, 2025

Uh oh!

ServeurpersoCom commented Oct 8, 2025

Uh oh!

ServeurpersoCom commented Oct 8, 2025

Uh oh!

allozaur commented Oct 8, 2025

Uh oh!

ggerganov commented Oct 8, 2025

Uh oh!

allozaur commented Oct 8, 2025

Uh oh!

ServeurpersoCom commented Oct 8, 2025

Uh oh!

Uh oh!

Uh oh!

ServeurpersoCom commented Oct 2, 2025 •

edited

Loading

ServeurpersoCom commented Oct 4, 2025 •

edited

Loading

ServeurpersoCom commented Oct 4, 2025 •

edited

Loading