Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion tools/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2188,6 +2188,24 @@ struct server_context {

metrics.init();

// thinking is enabled if:
// 1. It's not explicitly disabled (reasoning_budget == 0)
// 2. The chat template supports it
bool enable_thinking = params_base.reasoning_budget != 0;
if (enable_thinking) {
common_chat_templates_inputs dummy_inputs;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic could move into a free-function in chat.* (something like common_chat_supports_enable_thinking).

Also, an alternate implementation of this would be to be more explicit and do some kind of a switch on the subtype of common_chat_params, but that felt hard to maintain (similar to more places where an arch enum is required). Since this is a boot-time operation, it feels ok to do the extra template expansion to avoid another place where a developer would need to poke in architecture-specific logic.

common_chat_msg msg;
msg.role = "user";
msg.content = "test";
dummy_inputs.messages = {msg};
dummy_inputs.enable_thinking = false;
const auto rendered_no_thinking = common_chat_templates_apply(chat_templates.get(), dummy_inputs);
dummy_inputs.enable_thinking = true;
const auto rendered_with_thinking = common_chat_templates_apply(chat_templates.get(), dummy_inputs);
enable_thinking = rendered_no_thinking.prompt != rendered_with_thinking.prompt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the logic of this check - seems we would disable thinking if the 2 rendered prompts are different. Why would that be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, got it. This should be moved to a helper function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check in a few, but I think the logic as written is that if the two prompts are different (with and without thinking), the enable_thinking bool is set to true. If they match, the assumption is that the enable_thinking flag is ignored in the template. This is certainly a heuristic since some templates might only honor it under certain conditions (eg the funky Granite 3.x XOR conditions with documents and tools).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this is moved to common/chat.* now in the function common_chat_templates_support_enable_thinking

}
SRV_INF("Enable thinking? %d\n", enable_thinking);

oai_parser_opt = {
/* use_jinja */ params_base.use_jinja,
/* prefill_assistant */ params_base.prefill_assistant,
Expand All @@ -2196,7 +2214,7 @@ struct server_context {
/* common_chat_templates */ chat_templates.get(),
/* allow_image */ mctx ? mtmd_support_vision(mctx) : false,
/* allow_audio */ mctx ? mtmd_support_audio (mctx) : false,
/* enable_thinking */ params_base.reasoning_budget != 0,
/* enable_thinking */ enable_thinking,
};
}

Expand Down
16 changes: 13 additions & 3 deletions tools/server/utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ static T json_value(const json & body, const std::string & key, const T & defaul
if (body.contains(key) && !body.at(key).is_null()) {
try {
return body.at(key);
} catch (NLOHMANN_JSON_NAMESPACE::detail::type_error const &) {
LOG_WRN("Wrong type supplied for parameter '%s'. Expected '%s', using default value\n", key.c_str(), json(default_value).type_name());
} catch (NLOHMANN_JSON_NAMESPACE::detail::type_error const & err) {
LOG_WRN("Wrong type supplied for parameter '%s'. Expected '%s', using default value: %s\n", key.c_str(), json(default_value).type_name(), err.what());
return default_value;
}
} else {
Expand Down Expand Up @@ -708,6 +708,16 @@ static json oaicompat_chat_params_parse(
inputs.chat_template_kwargs[item.key()] = item.value().dump();
}

// parse the "enable_thinking" kwarg to override the default value
auto enable_thinking_kwarg = json_value(inputs.chat_template_kwargs, "enable_thinking", std::string(""));
if (enable_thinking_kwarg == "true") {
inputs.enable_thinking = true;
} else if (enable_thinking_kwarg == "false") {
inputs.enable_thinking = false;
} else if (!enable_thinking_kwarg.empty() && enable_thinking_kwarg[0] == '"') {
throw std::runtime_error("invalid type for \"enable_thinking\" (expected boolean, got string)");
}

// if the assistant message appears at the end of list, we do not add end-of-turn token
// for ex. this can be useful to modify the reasoning process in reasoning models
bool prefill_assistant_message = !inputs.messages.empty() && inputs.messages.back().role == "assistant" && opt.prefill_assistant;
Expand All @@ -724,7 +734,7 @@ static json oaicompat_chat_params_parse(
/* TODO: test this properly */
inputs.reasoning_format = COMMON_REASONING_FORMAT_NONE;

if ( (!inputs.enable_thinking) || inputs.chat_template_kwargs.find("enable_thinking") != inputs.chat_template_kwargs.end()) {
if ( inputs.enable_thinking ) {
throw std::runtime_error("Assistant response prefill is incompatible with enable_thinking.");
}

Expand Down