Skip to content

Commit 12bbc3f

Browse files
ServeurpersoComallozaurngxson
authored
refactor: centralize CoT parsing in backend for streaming mode (#16394)
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <[email protected]> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <[email protected]> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <[email protected]> --------- Co-authored-by: Aleksander Grygier <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>
1 parent 9d08828 commit 12bbc3f

File tree

14 files changed

+265
-420
lines changed

14 files changed

+265
-420
lines changed

common/arg.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3432,7 +3432,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
34323432
{"--reasoning-format"}, "FORMAT",
34333433
"controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:\n"
34343434
"- none: leaves thoughts unparsed in `message.content`\n"
3435-
"- deepseek: puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`)\n"
3435+
"- deepseek: puts thoughts in `message.reasoning_content`\n"
3436+
"- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`\n"
34363437
"(default: auto)",
34373438
[](common_params & params, const std::string & value) {
34383439
params.reasoning_format = common_reasoning_format_from_name(value);

common/chat-parser.cpp

Lines changed: 125 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,12 @@
33
#include "log.h"
44
#include "regex-partial.h"
55

6+
#include <algorithm>
7+
#include <cctype>
68
#include <optional>
79
#include <stdexcept>
810
#include <string>
11+
#include <string_view>
912
#include <vector>
1013

1114
using json = nlohmann::ordered_json;
@@ -166,6 +169,27 @@ void common_chat_msg_parser::consume_literal(const std::string & literal) {
166169
}
167170

168171
bool common_chat_msg_parser::try_parse_reasoning(const std::string & start_think, const std::string & end_think) {
172+
std::string pending_reasoning_prefix;
173+
174+
if (syntax_.reasoning_format == COMMON_REASONING_FORMAT_NONE) {
175+
return false;
176+
}
177+
178+
auto set_reasoning_prefix = [&](size_t prefix_pos) {
179+
if (!syntax_.thinking_forced_open || syntax_.reasoning_in_content) {
180+
return;
181+
}
182+
if (prefix_pos + start_think.size() > input_.size()) {
183+
pending_reasoning_prefix.clear();
184+
return;
185+
}
186+
// Capture the exact literal that opened the reasoning section so we can
187+
// surface it back to callers. This ensures formats that force the
188+
// reasoning tag open (e.g. DeepSeek R1) retain their original prefix
189+
// instead of dropping it during parsing.
190+
pending_reasoning_prefix = input_.substr(prefix_pos, start_think.size());
191+
};
192+
169193
auto handle_reasoning = [&](const std::string & reasoning, bool closed) {
170194
auto stripped_reasoning = string_strip(reasoning);
171195
if (stripped_reasoning.empty()) {
@@ -178,28 +202,116 @@ bool common_chat_msg_parser::try_parse_reasoning(const std::string & start_think
178202
add_content(syntax_.reasoning_format == COMMON_REASONING_FORMAT_DEEPSEEK ? "</think>" : end_think);
179203
}
180204
} else {
205+
if (!pending_reasoning_prefix.empty()) {
206+
add_reasoning_content(pending_reasoning_prefix);
207+
pending_reasoning_prefix.clear();
208+
}
181209
add_reasoning_content(stripped_reasoning);
182210
}
183211
};
184-
if (syntax_.reasoning_format != COMMON_REASONING_FORMAT_NONE) {
185-
if (syntax_.thinking_forced_open || try_consume_literal(start_think)) {
186-
if (auto res = try_find_literal(end_think)) {
187-
handle_reasoning(res->prelude, /* closed */ true);
188-
consume_spaces();
189-
return true;
190-
}
191-
auto rest = consume_rest();
212+
213+
const size_t saved_pos = pos_;
214+
const size_t saved_content_size = result_.content.size();
215+
const size_t saved_reasoning_size = result_.reasoning_content.size();
216+
217+
auto restore_state = [&]() {
218+
move_to(saved_pos);
219+
result_.content.resize(saved_content_size);
220+
result_.reasoning_content.resize(saved_reasoning_size);
221+
};
222+
223+
// Allow leading whitespace to be preserved as content when reasoning is present at the start
224+
size_t cursor = pos_;
225+
size_t whitespace_end = cursor;
226+
while (whitespace_end < input_.size() && std::isspace(static_cast<unsigned char>(input_[whitespace_end]))) {
227+
++whitespace_end;
228+
}
229+
230+
if (whitespace_end >= input_.size()) {
231+
restore_state();
232+
if (syntax_.thinking_forced_open) {
233+
auto rest = input_.substr(saved_pos);
192234
if (!rest.empty()) {
193235
handle_reasoning(rest, /* closed */ !is_partial());
194236
}
195-
// Allow unclosed thinking tags, for now (https://github.com/ggml-org/llama.cpp/issues/13812, https://github.com/ggml-org/llama.cpp/issues/13877)
196-
// if (!syntax_.thinking_forced_open) {
197-
// throw common_chat_msg_partial_exception(end_think);
198-
// }
237+
move_to(input_.size());
199238
return true;
200239
}
240+
return false;
241+
}
242+
243+
cursor = whitespace_end;
244+
const size_t remaining = input_.size() - cursor;
245+
const size_t start_prefix = std::min(start_think.size(), remaining);
246+
const bool has_start_tag = input_.compare(cursor, start_prefix, start_think, 0, start_prefix) == 0;
247+
248+
if (has_start_tag && start_prefix < start_think.size()) {
249+
move_to(input_.size());
250+
return true;
251+
}
252+
253+
if (has_start_tag) {
254+
if (whitespace_end > pos_) {
255+
add_content(input_.substr(pos_, whitespace_end - pos_));
256+
}
257+
set_reasoning_prefix(cursor);
258+
cursor += start_think.size();
259+
} else if (syntax_.thinking_forced_open) {
260+
cursor = whitespace_end;
261+
} else {
262+
restore_state();
263+
return false;
264+
}
265+
while (true) {
266+
if (cursor >= input_.size()) {
267+
move_to(input_.size());
268+
return true;
269+
}
270+
271+
size_t end_pos = input_.find(end_think, cursor);
272+
if (end_pos == std::string::npos) {
273+
std::string_view remaining_view(input_.data() + cursor, input_.size() - cursor);
274+
size_t partial_off = string_find_partial_stop(remaining_view, end_think);
275+
size_t reasoning_end = partial_off == std::string::npos ? input_.size() : cursor + partial_off;
276+
if (reasoning_end > cursor) {
277+
handle_reasoning(input_.substr(cursor, reasoning_end - cursor), /* closed */ partial_off == std::string::npos && !is_partial());
278+
}
279+
move_to(input_.size());
280+
return true;
281+
}
282+
283+
if (end_pos > cursor) {
284+
handle_reasoning(input_.substr(cursor, end_pos - cursor), /* closed */ true);
285+
} else {
286+
handle_reasoning("", /* closed */ true);
287+
}
288+
289+
cursor = end_pos + end_think.size();
290+
291+
while (cursor < input_.size() && std::isspace(static_cast<unsigned char>(input_[cursor]))) {
292+
++cursor;
293+
}
294+
295+
const size_t next_remaining = input_.size() - cursor;
296+
if (next_remaining == 0) {
297+
move_to(cursor);
298+
return true;
299+
}
300+
301+
const size_t next_prefix = std::min(start_think.size(), next_remaining);
302+
if (input_.compare(cursor, next_prefix, start_think, 0, next_prefix) == 0) {
303+
if (next_prefix < start_think.size()) {
304+
move_to(input_.size());
305+
return true;
306+
}
307+
set_reasoning_prefix(cursor);
308+
cursor += start_think.size();
309+
continue;
310+
}
311+
312+
move_to(cursor);
313+
return true;
201314
}
202-
return false;
203315
}
204316

205317
std::string common_chat_msg_parser::consume_rest() {

common/chat.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1408,6 +1408,8 @@ static common_chat_params common_chat_params_init_apertus(const common_chat_temp
14081408
return data;
14091409
}
14101410
static void common_chat_parse_llama_3_1(common_chat_msg_parser & builder, bool with_builtin_tools = false) {
1411+
builder.try_parse_reasoning("<think>", "</think>");
1412+
14111413
if (!builder.syntax().parse_tool_calls) {
14121414
builder.add_content(builder.consume_rest());
14131415
return;
@@ -2862,6 +2864,7 @@ common_chat_params common_chat_templates_apply(
28622864
}
28632865

28642866
static void common_chat_parse_content_only(common_chat_msg_parser & builder) {
2867+
builder.try_parse_reasoning("<think>", "</think>");
28652868
builder.add_content(builder.consume_rest());
28662869
}
28672870

common/common.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -433,7 +433,7 @@ struct common_params {
433433
std::string chat_template = ""; // NOLINT
434434
bool use_jinja = false; // NOLINT
435435
bool enable_chat_template = true;
436-
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_AUTO;
436+
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
437437
int reasoning_budget = -1;
438438
bool prefill_assistant = true; // if true, any trailing assistant message will be prefilled into the response
439439

tests/test-chat-parser.cpp

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,34 @@ static void test_reasoning() {
106106
assert_equals("<think>Cogito</think>", builder.result().content);
107107
assert_equals("Ergo sum", builder.consume_rest());
108108
}
109+
{
110+
const std::string variant("content_only_inline_think");
111+
common_chat_syntax syntax = {
112+
/* .format = */ COMMON_CHAT_FORMAT_CONTENT_ONLY,
113+
/* .reasoning_format = */ COMMON_REASONING_FORMAT_DEEPSEEK,
114+
/* .reasoning_in_content = */ false,
115+
/* .thinking_forced_open = */ false,
116+
/* .parse_tool_calls = */ false,
117+
};
118+
const std::string input = "<think>Pense</think>Bonjour";
119+
auto msg = common_chat_parse(input, false, syntax);
120+
assert_equals(variant, std::string("Pense"), msg.reasoning_content);
121+
assert_equals(variant, std::string("Bonjour"), msg.content);
122+
}
123+
{
124+
const std::string variant("llama_3_inline_think");
125+
common_chat_syntax syntax = {
126+
/* .format = */ COMMON_CHAT_FORMAT_LLAMA_3_X,
127+
/* .reasoning_format = */ COMMON_REASONING_FORMAT_DEEPSEEK,
128+
/* .reasoning_in_content = */ false,
129+
/* .thinking_forced_open = */ false,
130+
/* .parse_tool_calls = */ false,
131+
};
132+
const std::string input = "<think>Plan</think>Réponse";
133+
auto msg = common_chat_parse(input, false, syntax);
134+
assert_equals(variant, std::string("Plan"), msg.reasoning_content);
135+
assert_equals(variant, std::string("Réponse"), msg.content);
136+
}
109137
// Test DeepSeek V3.1 parsing - reasoning content followed by "</think>" and then regular content
110138
{
111139
common_chat_syntax syntax = {

tools/server/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ The project is under active development, and we are [looking for feedback and co
190190
| `--no-slots` | disables slots monitoring endpoint<br/>(env: LLAMA_ARG_NO_ENDPOINT_SLOTS) |
191191
| `--slot-save-path PATH` | path to save slot kv cache (default: disabled) |
192192
| `--jinja` | use jinja template for chat (default: disabled)<br/>(env: LLAMA_ARG_JINJA) |
193-
| `--reasoning-format FORMAT` | controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:<br/>- none: leaves thoughts unparsed in `message.content`<br/>- deepseek: puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`)<br/>(default: auto)<br/>(env: LLAMA_ARG_THINK) |
193+
| `--reasoning-format FORMAT` | controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:<br/>- none: leaves thoughts unparsed in `message.content`<br/>- deepseek: puts thoughts in `message.reasoning_content`<br/>- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`<br/>(default: deepseek)<br/>(env: LLAMA_ARG_THINK) |
194194
| `--reasoning-budget N` | controls the amount of thinking allowed; currently only one of: -1 for unrestricted thinking budget, or 0 to disable thinking (default: -1)<br/>(env: LLAMA_ARG_THINK_BUDGET) |
195195
| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
196196
| `--chat-template-file JINJA_TEMPLATE_FILE` | set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Lines changed: 3 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
<script lang="ts">
22
import { getDeletionInfo } from '$lib/stores/chat.svelte';
33
import { copyToClipboard } from '$lib/utils/copy';
4-
import { parseThinkingContent } from '$lib/utils/thinking';
54
import ChatMessageAssistant from './ChatMessageAssistant.svelte';
65
import ChatMessageUser from './ChatMessageUser.svelte';
76
@@ -47,26 +46,13 @@
4746
4847
let thinkingContent = $derived.by(() => {
4948
if (message.role === 'assistant') {
50-
if (message.thinking) {
51-
return message.thinking;
52-
}
53-
54-
const parsed = parseThinkingContent(message.content);
49+
const trimmedThinking = message.thinking?.trim();
5550
56-
return parsed.thinking;
51+
return trimmedThinking ? trimmedThinking : null;
5752
}
5853
return null;
5954
});
6055
61-
let messageContent = $derived.by(() => {
62-
if (message.role === 'assistant') {
63-
const parsed = parseThinkingContent(message.content);
64-
return parsed.cleanContent?.replace('<|channel|>analysis', '');
65-
}
66-
67-
return message.content?.replace('<|channel|>analysis', '');
68-
});
69-
7056
function handleCancelEdit() {
7157
isEditing = false;
7258
editedContent = message.content;
@@ -165,7 +151,7 @@
165151
{editedContent}
166152
{isEditing}
167153
{message}
168-
{messageContent}
154+
messageContent={message.content}
169155
onCancelEdit={handleCancelEdit}
170156
onConfirmDelete={handleConfirmDelete}
171157
onCopy={handleCopy}

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,11 @@
131131
</div>
132132
</div>
133133
{:else if message.role === 'assistant'}
134-
<MarkdownContent content={messageContent || ''} />
134+
{#if config().disableReasoningFormat}
135+
<pre class="raw-output">{messageContent || ''}</pre>
136+
{:else}
137+
<MarkdownContent content={messageContent || ''} />
138+
{/if}
135139
{:else}
136140
<div class="text-sm whitespace-pre-wrap">
137141
{messageContent}
@@ -203,4 +207,21 @@
203207
background-position: -200% 0;
204208
}
205209
}
210+
211+
.raw-output {
212+
width: 100%;
213+
max-width: 48rem;
214+
margin-top: 1.5rem;
215+
padding: 1rem 1.25rem;
216+
border-radius: 1rem;
217+
background: hsl(var(--muted) / 0.3);
218+
color: var(--foreground);
219+
font-family:
220+
ui-monospace, SFMono-Regular, 'SF Mono', Monaco, 'Cascadia Code', 'Roboto Mono', Consolas,
221+
'Liberation Mono', Menlo, monospace;
222+
font-size: 0.875rem;
223+
line-height: 1.6;
224+
white-space: pre-wrap;
225+
word-break: break-word;
226+
}
206227
</style>

tools/server/webui/src/lib/components/app/chat/ChatSettings/ChatSettingsDialog.svelte

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,12 @@
148148
key: 'showThoughtInProgress',
149149
label: 'Show thought in progress',
150150
type: 'checkbox'
151+
},
152+
{
153+
key: 'disableReasoningFormat',
154+
label:
155+
'Show raw LLM output without backend parsing and frontend Markdown rendering to inspect streaming across different models.',
156+
type: 'checkbox'
151157
}
152158
]
153159
},

tools/server/webui/src/lib/constants/settings-config.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ export const SETTING_CONFIG_DEFAULT: Record<string, string | number | boolean> =
66
theme: 'system',
77
showTokensPerSecond: false,
88
showThoughtInProgress: false,
9+
disableReasoningFormat: false,
910
keepStatsVisible: false,
1011
askForTitleConfirmation: false,
1112
pasteLongTextToFileLen: 2500,
@@ -76,6 +77,8 @@ export const SETTING_CONFIG_INFO: Record<string, string> = {
7677
custom: 'Custom JSON parameters to send to the API. Must be valid JSON format.',
7778
showTokensPerSecond: 'Display generation speed in tokens per second during streaming.',
7879
showThoughtInProgress: 'Expand thought process by default when generating messages.',
80+
disableReasoningFormat:
81+
'Show raw LLM output without backend parsing and frontend Markdown rendering to inspect streaming across different models.',
7982
keepStatsVisible: 'Keep processing statistics visible after generation finishes.',
8083
askForTitleConfirmation:
8184
'Ask for confirmation before automatically changing conversation title when editing the first message.',

0 commit comments

Comments
 (0)