-
Notifications
You must be signed in to change notification settings - Fork 13.9k
llama-cli: add support for reasoning #16603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
CISC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but could be improved.
| if (!diff.reasoning_content_delta.empty()) { | ||
| result.push_back({diff.reasoning_content_delta, REASONING}); | ||
| had_reasoning_ = true; | ||
| } | ||
| if (!diff.content_delta.empty()) { | ||
| if (had_reasoning_) { | ||
| result.push_back({"\n", REASONING}); | ||
| had_reasoning_ = false; | ||
| } | ||
| result.push_back({diff.content_delta, CONTENT}); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the thinking tags are eaten it makes it really hard to separate thinking from the rest.
Would it be an idea to highlight thinking in another color? Would require some additional logging API to check status of color and/or logging with g_col.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay sounds good! What do you think about adding something like "Thinking..." when the reasoning starts as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment below re. the logging API. It doesn't make sense to tightly couple the notion of reasoning into the logging API, as there is a separation of concerns: application-specific output versus generic logging behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't notice your comments until now, the coloring works very well, but we do need some kind of separation when colors are not enabled as well, though hard to define so that they can't be confused with actual output.
LOG/write output will get a little jumbled now, take f.ex. the following output from a --verbose-prompt -p "..." run:
151644 -> '<|im_start|>'
872 -> 'user'
198 -> '
'
36953 -> 'Pick'
264 -> ' a'
1967 -> ' Le'
89260 -> 'etCode'
8645 -> ' challenge'
323 -> ' and'
11625 -> ' solve'
432 -> ' it'
304 -> ' in'
Pick a LeetCode challenge and solve it in Python.
13027 -> ' Python'
13 -> '.'
151645 -> '<|im_end|>'
198 -> '
'
151644 -> '<|im_start|>'
77091 -> 'assistant'
198 -> '
'
151667 -> '<think>'
198 -> '
'
Not a major issue, but a little weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. We could call "common_log_pause()/resume()" while writing to the console. This would just require storing the log pointer in the console, which could be passed into the init procedure and be an optional argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, could use common_log_main() singleton directly and keep it in the console.cpp and always pause the log before output if that's desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's hold off on this until @ggerganov has weighed in on console::write in the first place as this is disruptive behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the console could indeed hold a reference to the common_log, but instead of pause/resume, it can simply call LOG_CNT to print stuff through the existing log instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Please see latest check-in. I updated to use the logging system when it is enabled, otherwise it writes directly to console. From my tests with -v and -co everything seems to be in sync now (fixing the jumbled output), and when --log-disable is specified the output stays intact.
When transitioning to user input there was a bit of a race condition so I added a flush routine so the console waits for the remaining log messages to come in before switching to the user prompt. Otherwise colors were spilling into the log messages.
|
|
Keeping the tags would be hard, I don't think it's much of an issue as long as we have visual separation, the main improvement here is enabling |
If that's intended with jinja, then it's fine, but I would still suggest improving it in future. So long as LLMs can still hallucinate and have mismatched templates, it's always better to double-check. |
@MaggotHATE Any chance you would provide an example of the intended testing scenario? Testing of course provides a nice angle having features in llama-cli that complement the server, which might not want those capabilities built in. Side note: after getting this reasoning in I am going to revisit the tool-call capabilities (as this PR implements much of the required foundation). Part of my initial attempt was too complicated—especially when MCP added OAuth handshakes to the HTTP SSE transport, to me it doesn't make sense to add such complexity and that is the realm of a scripting language. What "take two" will have is: (1) only a single toolcall.cpp/h inside the llama-cli project; (2) only support toolcalls via the stdio transport (because there are nice local nodejs proxies and so-forth). This will add nice testability to the toolcalls. |
Any long, continuous dialog with a model would provide and good understanding if is works correctly and generates all required special tokens; this is especially important with different sampling combinations and settings. For example, old Magistral used to have problems with its thinking tags, which should be fixed in 2509 (I have tested it briefly only, as the model works better without reasoning). Moreover, the idea of "hybrid" reasoning is still in the air, which makes differentiating and outlining reasoning portions of generated text even more important. I don't use Jinja, but my understanding is that it would only "render" correct combinations of tags - still, being able to actually see the entire template would be helpful for testing (maybe an arg?).
|
|
@MaggotHATE The MCP stdio transport basically execs a process and opens stdin/stdout channel between it. So it will amount to user specifying one-or-more command-lines to run. And if folks want to use HTTP/SSE there are "adapter" programs that can proxy the local request/responses to HTTP/SSE (if they so desired). That means there is no networking built-in, but the capability is 100% there already using some nodejs apps and so-forth.
Do you render with the legacy templates or bypass templates altogether? |
Thanks for explaining, I don't have first-hand experience with it and clearly misunderstood it. It will be interesting to have it in
I use legacy-style templates in my own |
|
@CISC There is a race condition happening when the colors are changed using I think we need to separate the main output from the log output. Any existing call to LOG(...) should write immediately, as it should only ever go to stdout. If callers for whatever reason wanted to redirect this, it should be done explicitly on the command-line. |
|
I fixed the issue by adding a One caveat is just that the Happy to discuss/make further adjustments. 😊 |
|
The guard against stripped reasoning is very nice, prevents crashes with several templates! However something is not quite right, f.ex. with |
Yeah, it really needs to stand out from regular output, that's hard to accomplish though, I was toying with the idea of perhaps just a simple |
|
Hmm. Perhaps we just leave it Thinking ... ... for now as it is the most "natural language" way; I would imagine folks will use color anyhow. In the future if there's a reason for concern we can change it. 😉 EDIT: I will create a couple screenshots we can use for comparison. @CISC What do you think of these? If we want something terse, maybe a specific glyph might be best to convey the meaning: Logic/Math symbols (most thematically appropriate):∴ (U+2234) - "Therefore" symbol - perfect for reasoning/conclusions General delimiters (widely compatible):§ (U+00A7) - Section sign - traditional formal marker |
Sorry for the slow response. The double arrow is perhaps not a bad one... |
|
@CISC No worries on the delay! Merge conflicts on llama-cli should be minimal :) Here are a few of the screenshots. I tend to agree that the double-arrow has the right contextual meaning and sufficient visual prominence. The other symbols kind of sink into the background a bit.
|
I think it should also be prepended to the regular output to better mark the separation, maybe even colored green to match the input prompt. Now, the trick is, if a user redirects output to a file we probably shouldn't be messing with the output like this, but then again we can't easily restore the thinking tokens either... |
Hmm. I think the usual way to handle this would be to write extra delimiters to stderr, so then outputs would just be redirected in that case: EDIT: So the idea is that the user calls llama-cli with
How do you mean? Please show an example. |
I simply mean the following (to copy your example): |
Yeah, at that point it would certainly make more sense to have some very explicit delimiters like that. |
|
Okay so to summarize some of these ideas:
Of these methods the ones that can be parsed supposing a conversation is written to a file would be (2), (3) and (4) or some variant of that. Option (1) is the most minimal for an interactive conversation, but impossible to parse from a file, and it takes a little more cognitive load to separate the reasoning from actual response. Option (2) would probably be easiest to see visually the entire block of reasoning, but somewhat verbose; writing to a file in this case would be able to parse the reasoning blocks line-by-line, which would work well. With color enabled none of this matters, so we're talking mainly about (a) running interactively without color; (b) running with --single-turn chat and sending the output to a file. |
Yep, though b) I'm not sure how common that is, and a) I think the next PR after this should be changing |
Yes I like that idea. 🙂 Save the user a command-line switch on every invocation! |
|
Yes, sounds good! Another switch saver 🙂 |
|
Don't want to be too disruptive, but I think we should hold off the current PR a little bit. Recently, I was thinking about completely refactoring The current CLI code built around the initial logic for simple text completion, so I think it maybe better to preserve its simplicity and move it to a new binary, for example: |
|
I am happy with that decision. Though, even going down that path it may still make sense to keep this reasoning functionality in the completions example. Do you mean that |
@ngxson I missed that - thanks for reminding. I am OK with reorganizing the |
Yes, the
|
|
@bandoti This PR only handle the formatting for reasoning, but I think it doesn't actually resolve the problem where some models want to go back and delete the reasoning contents in the past message. In the current PR, the The first input is: Generated part is: When you now send the second message, we expect to go back and delete the But in reality, |
|
Ah, interesting, thank you for the clarification. That makes sense not to litter the context with reasoning once the model makes a decision. |
|
It's not just about saving some tokens, but the bigger reason is that models are explicitly trained on input data which does not contain reasoning in past messages. While on inference, leaving reasoning there have little effect to the overall result, it does effectively change the underlying logits |








This change adds a "partial formatter" that processes partially collected messages (like the server streaming logic) in order to render reasoning logic prior to EOG token arrival.
In addition, the chat_add_and_format lambda has been moved to a functor, and this now calls common_chat_templates_apply directly to allow more robust template-application options.
Logic has been put in place to suppress the system/prompt tags to clean up output.
Example output :