Skip to content
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 117 additions & 17 deletions tools/main/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,104 @@ static void sigint_handler(int signo) {
}
#endif

class partial_formatter {
public:
enum output_type {
CONTENT,
REASONING,
};

struct output {
std::string formatted;
output_type type;
};

partial_formatter(const common_chat_syntax & syntax) : syntax_(syntax), had_reasoning_(false) {}

std::vector<output> operator()(const std::string & accumulated) {
common_chat_msg next = common_chat_parse(accumulated, true, syntax_);

auto diffs = common_chat_msg_diff::compute_diffs(previous_, next);
std::vector<output> result;
for (const auto & diff : diffs) {
if (!diff.reasoning_content_delta.empty()) {
result.push_back({diff.reasoning_content_delta, REASONING});
had_reasoning_ = true;
}
if (!diff.content_delta.empty()) {
if (had_reasoning_) {
result.push_back({"\n", REASONING});
had_reasoning_ = false;
}
result.push_back({diff.content_delta, CONTENT});
}
Comment on lines 106 to 119
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the thinking tags are eaten it makes it really hard to separate thinking from the rest.

Would it be an idea to highlight thinking in another color? Would require some additional logging API to check status of color and/or logging with g_col.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sounds good! What do you think about adding something like "Thinking..." when the reasoning starts as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment below re. the logging API. It doesn't make sense to tightly couple the notion of reasoning into the logging API, as there is a separation of concerns: application-specific output versus generic logging behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't notice your comments until now, the coloring works very well, but we do need some kind of separation when colors are not enabled as well, though hard to define so that they can't be confused with actual output.

LOG/write output will get a little jumbled now, take f.ex. the following output from a --verbose-prompt -p "..." run:

151644 -> '<|im_start|>'
   872 -> 'user'
   198 -> '
'
 36953 -> 'Pick'
   264 -> ' a'
  1967 -> ' Le'
 89260 -> 'etCode'
  8645 -> ' challenge'
   323 -> ' and'
 11625 -> ' solve'
   432 -> ' it'
   304 -> ' in'
Pick a LeetCode challenge and solve it in Python.
 13027 -> ' Python'
    13 -> '.'
151645 -> '<|im_end|>'
   198 -> '
'
151644 -> '<|im_start|>'
 77091 -> 'assistant'
   198 -> '
'
151667 -> '<think>'
   198 -> '
'

Not a major issue, but a little weird.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. We could call "common_log_pause()/resume()" while writing to the console. This would just require storing the log pointer in the console, which could be passed into the init procedure and be an optional argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, could use common_log_main() singleton directly and keep it in the console.cpp and always pause the log before output if that's desired.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hold off on this until @ggerganov has weighed in on console::write in the first place as this is disruptive behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the console could indeed hold a reference to the common_log, but instead of pause/resume, it can simply call LOG_CNT to print stuff through the existing log instance.

Copy link
Collaborator Author

@bandoti bandoti Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Please see latest check-in. I updated to use the logging system when it is enabled, otherwise it writes directly to console. From my tests with -v and -co everything seems to be in sync now (fixing the jumbled output), and when --log-disable is specified the output stays intact.

When transitioning to user input there was a bit of a race condition so I added a flush routine so the console waits for the remaining log messages to come in before switching to the user prompt. Otherwise colors were spilling into the log messages.

}
previous_ = next;
return result;
}

private:
common_chat_syntax syntax_;
common_chat_msg previous_;
bool had_reasoning_;
};

class chat_formatter {
public:
chat_formatter(
std::vector<common_chat_msg> & chat_msgs,
const common_chat_templates_ptr & chat_templates,
const common_params & params)
: chat_msgs_(chat_msgs),
chat_templates_(chat_templates),
params_(params) {}

std::string operator()(const std::string & role, const std::string & content) {
common_chat_msg new_msg;
new_msg.role = role;
new_msg.content = content;
chat_msgs_.push_back(new_msg);

common_chat_templates_inputs cinputs;
cinputs.use_jinja = params_.use_jinja;
cinputs.messages = chat_msgs_;
cinputs.add_generation_prompt = (role == "user");
cinputs.reasoning_format = params_.reasoning_format;

cinputs.enable_thinking =
params_.use_jinja &&
params_.reasoning_budget != 0 &&
common_chat_templates_support_enable_thinking(chat_templates_.get());

common_chat_params cparams = common_chat_templates_apply(chat_templates_.get(), cinputs);

if (!partial_formatter_ptr_ && params_.reasoning_format != COMMON_REASONING_FORMAT_NONE) {
common_chat_syntax chat_syntax;
chat_syntax.format = cparams.format;
chat_syntax.reasoning_format = params_.reasoning_format;
chat_syntax.thinking_forced_open = cparams.thinking_forced_open;
chat_syntax.parse_tool_calls = false;
partial_formatter_ptr_ = std::make_unique<partial_formatter>(chat_syntax);
}

std::string formatted = cparams.prompt.substr(formatted_cumulative_.size());
formatted_cumulative_ = cparams.prompt;

LOG_DBG("formatted: '%s'\n", formatted.c_str());
return formatted;
}

partial_formatter * get_partial_formatter() { return partial_formatter_ptr_.get(); }
const std::string & get_full_prompt() const { return formatted_cumulative_; }

private:
std::vector<common_chat_msg> & chat_msgs_;
const common_chat_templates_ptr & chat_templates_;
const common_params & params_;
std::unique_ptr<partial_formatter> partial_formatter_ptr_;
std::string formatted_cumulative_;
};

int main(int argc, char ** argv) {
common_params params;
g_params = &params;
Expand Down Expand Up @@ -265,15 +363,7 @@ int main(int argc, char ** argv) {
std::vector<llama_token> embd_inp;

bool waiting_for_first_input = false;
auto chat_add_and_format = [&chat_msgs, &chat_templates](const std::string & role, const std::string & content) {
common_chat_msg new_msg;
new_msg.role = role;
new_msg.content = content;
auto formatted = common_chat_format_single(chat_templates.get(), chat_msgs, new_msg, role == "user", g_params->use_jinja);
chat_msgs.push_back(new_msg);
LOG_DBG("formatted: '%s'\n", formatted.c_str());
return formatted;
};
chat_formatter chat_add_and_format(chat_msgs, chat_templates, params);

std::string prompt;
{
Expand All @@ -291,13 +381,9 @@ int main(int argc, char ** argv) {
}

if (!params.system_prompt.empty() || !params.prompt.empty()) {
common_chat_templates_inputs inputs;
inputs.use_jinja = g_params->use_jinja;
inputs.messages = chat_msgs;
inputs.add_generation_prompt = !params.prompt.empty();

prompt = common_chat_templates_apply(chat_templates.get(), inputs).prompt;
prompt = chat_add_and_format.get_full_prompt();
}

} else {
// otherwise use the prompt as is
prompt = params.prompt;
Expand Down Expand Up @@ -562,6 +648,12 @@ int main(int argc, char ** argv) {
embd_inp.push_back(decoder_start_token_id);
}

if (chat_add_and_format.get_partial_formatter()) {
for (const auto & msg : chat_msgs) {
LOG("%s\n", msg.content.c_str());
}
}

while ((n_remain != 0 && !is_antiprompt) || params.interactive) {
// predict
if (!embd.empty()) {
Expand Down Expand Up @@ -709,6 +801,13 @@ int main(int argc, char ** argv) {

if (params.conversation_mode && !waiting_for_first_input && !llama_vocab_is_eog(vocab, id)) {
assistant_ss << common_token_to_piece(ctx, id, false);

if (auto * formatter = chat_add_and_format.get_partial_formatter()) {
auto outputs = (*formatter)(assistant_ss.str());
for (const auto & out : outputs) {
LOG("%s", out.formatted.c_str());
}
}
}

// echo this to console
Expand Down Expand Up @@ -740,8 +839,9 @@ int main(int argc, char ** argv) {
for (auto id : embd) {
const std::string token_str = common_token_to_piece(ctx, id, params.special);

// Console/Stream Output
LOG("%s", token_str.c_str());
if (!chat_add_and_format.get_partial_formatter()) {
LOG("%s", token_str.c_str());
}

// Record Displayed Tokens To Log
// Note: Generated tokens are created one by one hence this check
Expand Down
Loading