llama-cli: add support for reasoning #16603

bandoti · 2025-10-16T01:32:10Z

This change adds a "partial formatter" that processes partially collected messages (like the server streaming logic) in order to render reasoning logic prior to EOG token arrival.

In addition, the chat_add_and_format lambda has been moved to a functor, and this now calls common_chat_templates_apply directly to allow more robust template-application options.

Logic has been put in place to suppress the system/prompt tags to clean up output.

Example output :

./build/bin/llama-cli.exe -m ./models/gpt-oss-20b-mxfp4.gguf -c 2048 -sys "You are a wizard" -p "please recite me a haiku about llamas" --jinja -co

CISC

LGTM, but could be improved.

tools/main/main.cpp

CISC · 2025-10-19T10:02:04Z

tools/main/main.cpp

+            if (!diff.reasoning_content_delta.empty()) {
+                result.push_back({diff.reasoning_content_delta, REASONING});
+                had_reasoning_ = true;
+            }
+            if (!diff.content_delta.empty()) {
+                if (had_reasoning_) {
+                    result.push_back({"\n", REASONING});
+                    had_reasoning_ = false;
+                }
+                result.push_back({diff.content_delta, CONTENT});
+            }


Since the thinking tags are eaten it makes it really hard to separate thinking from the rest.

Would it be an idea to highlight thinking in another color? Would require some additional logging API to check status of color and/or logging with g_col.

Okay sounds good! What do you think about adding something like "Thinking..." when the reasoning starts as well?

Please see my comment below re. the logging API. It doesn't make sense to tightly couple the notion of reasoning into the logging API, as there is a separation of concerns: application-specific output versus generic logging behavior.

Sorry, didn't notice your comments until now, the coloring works very well, but we do need some kind of separation when colors are not enabled as well, though hard to define so that they can't be confused with actual output.

LOG/write output will get a little jumbled now, take f.ex. the following output from a --verbose-prompt -p "..." run:

151644 -> '<|im_start|>' 872 -> 'user' 198 -> ' ' 36953 -> 'Pick' 264 -> ' a' 1967 -> ' Le' 89260 -> 'etCode' 8645 -> ' challenge' 323 -> ' and' 11625 -> ' solve' 432 -> ' it' 304 -> ' in' Pick a LeetCode challenge and solve it in Python. 13027 -> ' Python' 13 -> '.' 151645 -> '<|im_end|>' 198 -> ' ' 151644 -> '<|im_start|>' 77091 -> 'assistant' 198 -> ' ' 151667 -> '<think>' 198 -> ' '

Not a major issue, but a little weird.

Hmm. We could call "common_log_pause()/resume()" while writing to the console. This would just require storing the log pointer in the console, which could be passed into the init procedure and be an optional argument.

Or, could use common_log_main() singleton directly and keep it in the console.cpp and always pause the log before output if that's desired.

Let's hold off on this until @ggerganov has weighed in on console::write in the first place as this is disruptive behavior.

I think the console could indeed hold a reference to the common_log, but instead of pause/resume, it can simply call LOG_CNT to print stuff through the existing log instance.

@ggerganov Please see latest check-in. I updated to use the logging system when it is enabled, otherwise it writes directly to console. From my tests with -v and -co everything seems to be in sync now (fixing the jumbled output), and when --log-disable is specified the output stays intact.

When transitioning to user input there was a bit of a race condition so I added a flush routine so the console waits for the remaining log messages to come in before switching to the user prompt. Otherwise colors were spilling into the log messages.

MaggotHATE · 2025-10-19T12:34:56Z

llama-cli exists not only for chatting, but also for testing models on a more "real-life scenario" use. It is better to keep all special tags visible for testing/debugging purposes. In case of reasoning, it should be visibly separated from the rest of the answer, as @CISC has suggested - it's hard to understand where the reasoning is in the example screenshot you've posted.

CISC · 2025-10-19T12:43:32Z

It is better to keep all special tags visible for testing/debugging purposes.

Keeping the tags would be hard, I don't think it's much of an issue as long as we have visual separation, the main improvement here is enabling --reasoning-budget.

MaggotHATE · 2025-10-19T13:11:41Z

Keeping the tags would be hard, I don't think it's much of an issue as long as we have visual separation, the main improvement here is enabling --reasoning-budget.

If that's intended with jinja, then it's fine, but I would still suggest improving it in future. So long as LLMs can still hallucinate and have mismatched templates, it's always better to double-check.

bandoti · 2025-10-19T14:51:51Z

llama-cli exists not only for chatting, but also for testing models on a more "real-life scenario" use.

@MaggotHATE Any chance you would provide an example of the intended testing scenario? Testing of course provides a nice angle having features in llama-cli that complement the server, which might not want those capabilities built in.

Side note: after getting this reasoning in I am going to revisit the tool-call capabilities (as this PR implements much of the required foundation). Part of my initial attempt was too complicated—especially when MCP added OAuth handshakes to the HTTP SSE transport, to me it doesn't make sense to add such complexity and that is the realm of a scripting language.

What "take two" will have is: (1) only a single toolcall.cpp/h inside the llama-cli project; (2) only support toolcalls via the stdio transport (because there are nice local nodejs proxies and so-forth).

This will add nice testability to the toolcalls.

MaggotHATE · 2025-10-19T15:47:56Z

Any chance you would provide an example of the intended testing scenario? Testing of course provides a nice angle having features in llama-cli that complement the server, which might not want those capabilities built in.

Any long, continuous dialog with a model would provide and good understanding if is works correctly and generates all required special tokens; this is especially important with different sampling combinations and settings. For example, old Magistral used to have problems with its thinking tags, which should be fixed in 2509 (I have tested it briefly only, as the model works better without reasoning). Moreover, the idea of "hybrid" reasoning is still in the air, which makes differentiating and outlining reasoning portions of generated text even more important.

I don't use Jinja, but my understanding is that it would only "render" correct combinations of tags - still, being able to actually see the entire template would be helpful for testing (maybe an arg?).

Side note: after getting this reasoning in I am going to revisit the tool-call capabilities (as this PR implements much of the required foundation). Part of my initial attempt was too complicated—especially when MCP added OAuth handshakes to the HTTP SSE transport, to me it doesn't make sense to add such complexity and that is the realm of a scripting language.

If I understood you correctly, I would advice against introducing any network-related features into llama-cli and for making a separate tool instead. As of right now, it is fully private, with no way to connect to a network, which is a guarantee. Changing that would make llama-cli potentially less secure/private. Ah yes, that was changed with the remote downloading of models. Alas.

bandoti · 2025-10-20T20:17:41Z

@MaggotHATE The MCP stdio transport basically execs a process and opens stdin/stdout channel between it. So it will amount to user specifying one-or-more command-lines to run. And if folks want to use HTTP/SSE there are "adapter" programs that can proxy the local request/responses to HTTP/SSE (if they so desired). That means there is no networking built-in, but the capability is 100% there already using some nodejs apps and so-forth.

I don't use Jinja, but my understanding is that it would only "render" correct combinations of tags - still, being able to actually see the entire template would be helpful for testing (maybe an arg?).

Do you render with the legacy templates or bypass templates altogether?

MaggotHATE · 2025-10-21T09:51:39Z

@MaggotHATE The MCP stdio transport basically execs a process and opens stdin/stdout channel between it. So it will amount to user specifying one-or-more command-lines to run. And if folks want to use HTTP/SSE there are "adapter" programs that can proxy the local request/responses to HTTP/SSE (if they so desired). That means there is no networking built-in, but the capability is 100% there already using some nodejs apps and so-forth.

Thanks for explaining, I don't have first-hand experience with it and clearly misunderstood it. It will be interesting to have it in llama-cli as (probably) the most straightforward way to test MCP capabilities.

Do you render with the legacy templates or bypass templates altogether?

I use legacy-style templates in my own llama-cli-based program, mostly for convenience of controlling everything from one .json config. If I remember correctly, there is a similar idea of simple "profile" files for llama.cpp, and in such case .jinja templates would become less essential (you can store the template in that same file, along with sampling settings and models paths, for example). At the same time, ChatML, as the most popular template format, makes manual configuration almost pointless - it's too strict.

bandoti · 2025-10-23T16:26:29Z

@CISC There is a race condition happening when the colors are changed using console::set_display(...); When this routine is called it sets the color immediately, but because llama-cli output is tightly coupled with the LOG macro, when log messages are queued using common_log_add(...), the output of the message itself is being processed later on.

I think we need to separate the main output from the log output. Any existing call to LOG(...) should write immediately, as it should only ever go to stdout. If callers for whatever reason wanted to redirect this, it should be done explicitly on the command-line.

bandoti · 2025-10-23T18:40:57Z

I fixed the issue by adding a console::write routine. Please see description for a screenshot of the new formatting with blue for the reasoning content.

One caveat is just that the --log-disable will no longer disable output, but this is easy to fix with: llama-cli ... >/dev/null for folks who need to silence it.

Happy to discuss/make further adjustments. 😊

CISC · 2025-10-25T11:24:47Z

The guard against stripped reasoning is very nice, prevents crashes with several templates!

However something is not quite right, f.ex. with Qwen3-4B-Thinking-2507 the following happens on the second prompt (after initial -p):

[...]

151644 -> '<|im_start|>'
   872 -> 'user'
   198 -> '
'
151645 -> '<|im_end|>'
   198 -> '
'
151644 -> '<|im_start|>'
 77091 -> 'assistant'
   198 -> '
'
151667 -> '<think>'
   198 -> '
'
, the user just sent an empty message after my previous response. Hmm, I need to figure out what they want now.

CISC · 2025-10-28T13:29:13Z

In addition, I added "Thinking ..." prefix and "...\n\n" suffix, but I am open to changing those. Another possibility could be something like: "[Thinking: ... ]" which seems maybe easier to see, since the models tend to output ... more frequently than square brackets.

Yeah, it really needs to stand out from regular output, that's hard to accomplish though, I was toying with the idea of perhaps just a simple < before reasoning and regular output as opposed to the user > .

bandoti · 2025-10-28T13:38:19Z

Hmm. Perhaps we just leave it Thinking ... ... for now as it is the most "natural language" way; I would imagine folks will use color anyhow. In the future if there's a reason for concern we can change it. 😉

EDIT: I will create a couple screenshots we can use for comparison.

@CISC What do you think of these? If we want something terse, maybe a specific glyph might be best to convey the meaning:

Logic/Math symbols (most thematically appropriate):

∴ (U+2234) - "Therefore" symbol - perfect for reasoning/conclusions
∵ (U+2235) - "Because" symbol - good for premises/reasoning
⊢ (U+22A2) - Turnstile - used in logic for "proves" or "entails"
⇒ (U+21D2) - Double arrow - implies/entails

General delimiters (widely compatible):

§ (U+00A7) - Section sign - traditional formal marker
¶ (U+00B6) - Pilcrow - paragraph/section marker
※ (U+203B) - Reference mark - attention/note marker
⁂ (U+2042) - Asterism - decorative section break
◆ (U+25C6) - Black diamond
▸ (U+25B8) - Small triangle - often used for disclosure/expansion

CISC · 2025-11-08T09:33:20Z

@CISC What do you think of these? If we want something terse, maybe a specific glyph might be best to convey the meaning:

Logic/Math symbols (most thematically appropriate):

∴ (U+2234) - "Therefore" symbol - perfect for reasoning/conclusions ∵ (U+2235) - "Because" symbol - good for premises/reasoning ⊢ (U+22A2) - Turnstile - used in logic for "proves" or "entails" ⇒ (U+21D2) - Double arrow - implies/entails

Sorry for the slow response. The double arrow is perhaps not a bad one...

bandoti · 2025-11-08T14:47:41Z

@CISC No worries on the delay! Merge conflicts on llama-cli should be minimal :)

Here are a few of the screenshots. I tend to agree that the double-arrow has the right contextual meaning and sufficient visual prominence. The other symbols kind of sink into the background a bit.

CISC · 2025-11-08T14:59:39Z

Here are a few of the screenshots. I tend to agree that the double-arrow has the right contextual meaning and sufficient visual prominence. The other symbols kind of sink into the background a bit.

I think it should also be prepended to the regular output to better mark the separation, maybe even colored green to match the input prompt.

Now, the trick is, if a user redirects output to a file we probably shouldn't be messing with the output like this, but then again we can't easily restore the thinking tokens either...

bandoti · 2025-11-08T15:06:11Z

Now, the trick is, if a user redirects output to a file we probably shouldn't be messing with the output like this

Hmm. I think the usual way to handle this would be to write extra delimiters to stderr, so then outputs would just be redirected in that case: llama-cli ... --single-turn 2>/dev/null > conversation.txt. Is that something we would want to support in the console as well?

EDIT: So the idea is that the user calls llama-cli with --single-turn to get one chat iteration and redirects to a file in a non-interactive session while keeping the interactive mode. In other modes it wouldn't apply the reasoning-partial formatter. It might be fine to keep the double-arrow in this case too because there is no way with the current formatting to parse the trailing end of the reasoning, so it would be a human (or an AI agent I guess).

I think it should also be prepended to the regular output to better mark the separation

How do you mean? Please show an example.

bandoti · 2025-11-08T15:30:57Z

Here is a more explicit version which would allow something to parse the reasoning after the fact while lending slightly more human-readability than the regular templates. Reading it though I prefer the concise double-arrow format, but if we need parsing capability or something we'd have to consider this sort of option.

And then building on it, with tool calls it would be something like [Calling tool: get_weather]. I actually have that implemented with a Tcl interpreter already (pending a much-delayed release). But this would use MCP for llama-cli instead of calling Tcl procedure.

CISC · 2025-11-08T18:07:42Z

I think it should also be prepended to the regular output to better mark the separation

How do you mean? Please show an example.

I simply mean the following (to copy your example):

⇒ The user asks: ...

⇒ Because ...

>

CISC · 2025-11-08T18:10:51Z

And then building on it, with tool calls it would be something like [Calling tool: get_weather]. I actually have that implemented with a Tcl interpreter already (pending a much-delayed release). But this would use MCP for llama-cli instead of calling Tcl procedure.

Yeah, at that point it would certainly make more sense to have some very explicit delimiters like that.

bandoti · 2025-11-10T18:56:19Z

Okay so to summarize some of these ideas:

Prefix the reasoning block with an "indicator" like ⇒.
Prefix all reasoning content with ⇒.
Wrap reasoning in "normalized" delimiters like "[Reasoning: ... ]".
Prefix with "Thinking... " and possible suffix like "..." (which I suppose is the same as 3 but harder to read).

Of these methods the ones that can be parsed supposing a conversation is written to a file would be (2), (3) and (4) or some variant of that. Option (1) is the most minimal for an interactive conversation, but impossible to parse from a file, and it takes a little more cognitive load to separate the reasoning from actual response. Option (2) would probably be easiest to see visually the entire block of reasoning, but somewhat verbose; writing to a file in this case would be able to parse the reasoning blocks line-by-line, which would work well.

With color enabled none of this matters, so we're talking mainly about (a) running interactively without color; (b) running with --single-turn chat and sending the output to a file.

CISC · 2025-11-10T19:04:21Z

With color enabled none of this matters, so we're talking mainly about (a) running interactively without color; (b) running with --single-turn chat and sending the output to a file.

Yep, though b) I'm not sure how common that is, and a) I think the next PR after this should be changing --color to work like --log-color, ie. auto/on/off with auto as default.

bandoti · 2025-11-10T19:06:52Z

I think the next PR after this should be changing --color to work like --log-color, ie. auto/on/off with auto as default.

Yes I like that idea. 🙂 Save the user a command-line switch on every invocation!

CISC · 2025-11-27T08:03:53Z

@bandoti See #17524, make jinja default for llama-cli as well?

bandoti · 2025-11-27T11:14:36Z

Yes, sounds good! Another switch saver 🙂

ngxson · 2025-11-27T20:09:10Z

Don't want to be too disruptive, but I think we should hold off the current PR a little bit.

Recently, I was thinking about completely refactoring llama-cli to reuse llama-server. With recent refactoring of server, this should become more and more easier to do. The main benefit of this approach is that many features, including the current PR (jinja support) and even multimodal support, will be able to run directly from llama-cli. See this comment for a demo

The current CLI code built around the initial logic for simple text completion, so I think it maybe better to preserve its simplicity and move it to a new binary, for example: llama-completion. CC @ggerganov asking you again about this.

bandoti · 2025-11-27T22:32:24Z

I am happy with that decision. Though, even going down that path it may still make sense to keep this reasoning functionality in the completions example. Do you mean that llama-completions would be llama-cli as it is today or something simpler?

ggerganov · 2025-11-28T10:12:39Z

The current CLI code built around the initial logic for simple text completion, so I think it maybe better to preserve its simplicity and move it to a new binary, for example: llama-completion. CC @ggerganov asking you again about this.

@ngxson I missed that - thanks for reminding.

I am OK with reorganizing the llama-cli tool an related if you have specific ideas - feel free to proceed. If you make llama-completion to be the same (or a simpler version) of current llama-cli, that would be OK. It might be a bit redundant given that we have llama-simple, but we can decide later. One option is to demote llama-completion from tool to an example and keep it around for experimentation (such as context extension, etc.).

ngxson · 2025-11-28T11:32:32Z

Do you mean that llama-completions would be llama-cli as it is today or something simpler?

Yes, the llama-completion will be the code base of llama-cli today minus all of the chat logic. The reason is because we don't want many duplicated chat logic across the project (multiple implementations to resolve the same problem). In other words, what we don't want is to have llama-server, llama-cli, llama-run each handles chat in a different way.

It might be a bit redundant given that we have llama-simple, but we can decide later.

llama-simple does not depend on libcommon, I think it still worth keeping it as a learning example (especially useful for people who want to build their own binding of llama.cpp to other languages)

ngxson · 2025-11-29T11:04:51Z

@bandoti This PR only handle the formatting for reasoning, but I think it doesn't actually resolve the problem where some models want to go back and delete the reasoning contents in the past message.

In the current PR, the llama-cli will still keep the reasoning content in memory, and only add contents on top. For example: https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF

The first input is:

<｜User｜>what is 123+456<｜Assistant｜>

Generated part is:

<think>
... (truncated)
</think>
\[
\boxed{579}
\]

When you now send the second message, we expect to go back and delete the <think>...</think> in the chat history, because jinja template tell us to do:

			{%- set content = message["content"] -%}
			{%- if "</think>" in content -%}
				{%- set content = content.split("</think>")[-1] -%}
			{%- endif -%}
			{{- "<｜Assistant｜>" + content + "<｜end▁of▁sentence｜>" -}}

But in reality, common_chat_format_single cant' handle this case, so the reasoning will still be stored inside chat history.

bandoti · 2025-11-29T12:51:46Z

Ah, interesting, thank you for the clarification. That makes sense not to litter the context with reasoning once the model makes a decision.

ngxson · 2025-11-29T13:06:06Z

It's not just about saving some tokens, but the bigger reason is that models are explicitly trained on input data which does not contain reasoning in past messages.

While on inference, leaving reasoning there have little effect to the overall result, it does effectively change the underlying logits

Add partial formatter

d230722

bandoti requested a review from ggerganov as a code owner October 16, 2025 01:32

github-actions bot added the examples label Oct 16, 2025

bandoti requested review from CISC and ggerganov and removed request for ggerganov October 16, 2025 01:32

bandoti added 2 commits October 16, 2025 09:13

Remove extra call to common_chat_templates_apply

3d94112

Suppress template markup in system & prompt display

a7771c1

This comment was marked as outdated.

Sign in to view

bandoti added 2 commits October 17, 2025 12:01

Track system/user prompt position

8694fa3

Remove complexity

e403844

This comment was marked as outdated.

Sign in to view

CISC approved these changes Oct 19, 2025

View reviewed changes

bandoti added 2 commits October 23, 2025 12:14

Add guards against stripped reasoning

c3768f4

Remove trailing _ for member variables

c381ea5

bandoti added 2 commits October 23, 2025 13:27

WIP: colorizing the reasoning content

3087ff7

Add new console::write routine

98b0d26

bandoti added 2 commits October 23, 2025 16:15

Rename syntax variable

becf4c5

Use non-template version of write routine

c879317

CISC linked an issue Nov 8, 2025 that may be closed by this pull request

Eval bug: Crash at second prompt #17096

Open

Merge branch 'master' into llamacli-reasoning2

dc882ee

DajanaV mentioned this pull request Nov 8, 2025

UPSTREAM PR #16603: llama-cli: add support for reasoning auroralabs-loci/llama.cpp#135

Closed

Use double-arrow as reasoning delimiter

e42715e

Merge branch 'master' into llamacli-reasoning2

7208aee

loci-dev mentioned this pull request Nov 21, 2025

UPSTREAM PR #16603: llama-cli: add support for reasoning auroralabs-loci/llama.cpp#283

Open

Merge branch 'master' into llamacli-reasoning2

e1f24f1

CISC mentioned this pull request Nov 26, 2025

server: enable jinja by default, update docs #17524

Merged

ngxson mentioned this pull request Nov 27, 2025

Feature Request: Better chat UX for llama-cli #11202

Open

llama-cli: add support for reasoning #16603

Are you sure you want to change the base?

llama-cli: add support for reasoning #16603

Conversation

bandoti commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bandoti Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaggotHATE commented Oct 19, 2025

Uh oh!

CISC commented Oct 19, 2025

Uh oh!

MaggotHATE commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bandoti commented Oct 19, 2025

Uh oh!

MaggotHATE commented Oct 19, 2025

Uh oh!

bandoti commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaggotHATE commented Oct 21, 2025

Uh oh!

bandoti commented Oct 23, 2025

Uh oh!

bandoti commented Oct 23, 2025

Uh oh!

CISC commented Oct 25, 2025

Uh oh!

CISC commented Oct 28, 2025

Uh oh!

bandoti commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Logic/Math symbols (most thematically appropriate):

General delimiters (widely compatible):

Uh oh!

CISC commented Nov 8, 2025

Logic/Math symbols (most thematically appropriate):

Uh oh!

bandoti commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Nov 8, 2025

Uh oh!

bandoti commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bandoti commented Nov 8, 2025

Uh oh!

CISC commented Nov 8, 2025

Uh oh!

CISC commented Nov 8, 2025

Uh oh!

bandoti commented Oct 16, 2025 •

edited

Loading

bandoti Oct 27, 2025 •

edited

Loading

MaggotHATE commented Oct 19, 2025 •

edited

Loading

bandoti commented Oct 20, 2025 •

edited

Loading

bandoti commented Oct 28, 2025 •

edited

Loading

bandoti commented Nov 8, 2025 •

edited

Loading

bandoti commented Nov 8, 2025 •

edited

Loading

bandoti commented Nov 10, 2025 •

edited

Loading

ngxson commented Nov 27, 2025 •

edited

Loading

ggerganov commented Nov 28, 2025 •

edited

Loading

ngxson commented Nov 29, 2025 •

edited

Loading