Skip to content

gpt-oss: implement harmony parsing #15181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

aldehir
Copy link
Collaborator

@aldehir aldehir commented Aug 8, 2025

This is my attempt at implementing a harmony parser for gpt-oss.

Implementation

  • Reasoning format support - both auto and none are supported. When none, <|channel|>analysis<|message|>{reasoning content}<|end|> is added to the content.
  • Tool parsing - tool parsing and grammar implemented. If parse_tool_calls == false, tool calls are added to the content verbatim--which aligns with other implementations.
  • Commentary preamble - the harmony format allows for a preamble in the commentary channel. If present, it is added to the content.
  • Tests added - perhaps too many test cases. I wanted to ensure proper parsing of partial messages.

Remaining Work

  • The harmony format specifies that reasoning content from the assistant's last tool call should be included in the next prompt. This implementation assumes it comes from the client in reasoning_content. However, none of the clients I tested send it. A simple workaround is to use reasoning_format = none, or add the reasoning to the content in tool calls.

@aldehir aldehir requested a review from ngxson as a code owner August 8, 2025 18:51
@github-actions github-actions bot added testing Everything test related examples server labels Aug 8, 2025
@abc-nix
Copy link

abc-nix commented Aug 8, 2025

Thanks. It finally made it much easier to use tools in Cherry Studio. And it generates thinking boxes properly.

@dagbs
Copy link

dagbs commented Aug 8, 2025

Without the PR:
image

With the PR:
using gpt-oss-20b:f16 from unsloth with the updated gguf
image

It's better, easily more usable, but there might be some issues around tool calling still.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 9, 2025

@dagbs try setting function calling to native in open-webui
image

@aldehir aldehir force-pushed the feature/harmony-parser branch from d65e556 to 981886f Compare August 9, 2025 03:18
@victorb
Copy link

victorb commented Aug 9, 2025

I tried this PR yesterday and compared it to #15158 (+ my own fixes on top of that PR) and there was a couple of issues with this PR (that I was gonna share this morning), but since da67163 was pushed, it seems to finally work better than that PR. In my (albeit limited) testing, seems tool calling and it's formatting is working a lot better. Thanks a ton for this patch @aldehir!

All the unit tests pass as well, compared to the other PR, and code organization at a glance seems better too, but granted I'm not cpp expert, just an generalist.

@victorb
Copy link

victorb commented Aug 9, 2025

Hmm, seems to still be breaking sometimes, tried to understand why but to no avail. Most of the time, it works perfectly fine, but seems some edge-case breaks it. Running da67163 right now.

If I repeatably use the same weather example maybe 10 times, I end up getting a badly parsed (on llama.cpp's side) maybe once.

Good run looks like this:

ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    "Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*",
                ),
                reasoning_content: Some(
                    "The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.",
                ),
                tool_calls: [],
            },
        },
    ],
}
sending:
[
    ChatMessage {
        role: "system",
        content: Some(
            "You are a helpful assistant. Help the user with whatever they need.\n",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "user",
        content: Some(
            "What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "ItCkpCeXs6jXspSwbFLidTHuATWM8MIj",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Barcelona\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Barcelona: ☀\u{fe0f}   +25°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "ItCkpCeXs6jXspSwbFLidTHuATWM8MIj",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "d92fjsjS8L5xBMTxSCmSWcNyhgISwo4u",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Stockholm\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Stockholm: ☀\u{fe0f}   +13°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "d92fjsjS8L5xBMTxSCmSWcNyhgISwo4u",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "0rIM7Xm598gzrRALjB4yMGZnuKRjOrSh",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Lima\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Lima: ⛅\u{fe0f}  +16°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "0rIM7Xm598gzrRALjB4yMGZnuKRjOrSh",
        ),
    },
]
[src/lib.rs:38:9] &val = Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "reasoning_content": String("The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer."),
                "content": String("Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**:  ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*"),
            },
        },
    ],
    "created": Number(1754730237),
    "model": String("gpt-oss-20b-MXFP4.gguf"),
    "system_fingerprint": String("b6124-da671637"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(440),
        "prompt_tokens": Number(361),
        "total_tokens": Number(801),
    },
    "id": String("chatcmpl-efjEpQIpXzIGe9j4F4gnC1X39B7mHOa3"),
    "__verbose": Object {
        "index": Number(0),
        "content": String("<|channel|>analysis<|message|>The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.\n\n<|end|><|start|>assistant<|channel|>final<|message|>Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*"),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-20b-MXFP4.gguf"),
        "tokens_predicted": Number(440),
        "tokens_evaluated": Number(361),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(1.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(40),
            "top_p": Number(1.0),
            "min_p": Number(1.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("add-args ::= \"{\" space add-args-a-kv \",\" space add-args-b-kv \"}\" space\nadd-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nadd-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nadd-call ::= \"add\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" add-args\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" get-weather-args\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nmultiply-args ::= \"{\" space multiply-args-a-kv \",\" space multiply-args-b-kv \"}\" space\nmultiply-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nmultiply-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nmultiply-call ::= \"multiply\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" multiply-args\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nroot ::= \"<|channel|>commentary to=functions.\" tool-call\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\ntool-call ::= add-call | multiply-call | get-weather-call\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>commentary to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-09\n\nReasoning: high\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// adds two numbers\ntype add = (_: {\na: number,\nb: number\n}) => any;\n\n// multiplies two numbers\ntype multiply = (_: {\na: number,\nb: number\n}) => any;\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Barcelona: ☀\u{fe0f}   +25°C\\\\n\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Stockholm: ☀\u{fe0f}   +13°C\\\\n\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Lima\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Lima: ⛅\u{fe0f}  +16°C\\\\n\\\"}\"<|end|><|start|>assistant"),
        "has_new_line": Bool(true),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(800),
        "timings": Object {
            "prompt_n": Number(49),
            "prompt_ms": Number(80.661),
            "prompt_per_token_ms": Number(1.6461428571428571),
            "prompt_per_second": Number(607.4806907923283),
            "predicted_n": Number(440),
            "predicted_ms": Number(2565.854),
            "predicted_per_token_ms": Number(5.831486363636364),
            "predicted_per_second": Number(171.48286691292645),
        },
    },
    "timings": Object {
        "prompt_n": Number(49),
        "prompt_ms": Number(80.661),
        "prompt_per_token_ms": Number(1.6461428571428571),
        "prompt_per_second": Number(607.4806907923283),
        "predicted_n": Number(440),
        "predicted_ms": Number(2565.854),
        "predicted_per_token_ms": Number(5.831486363636364),
        "predicted_per_second": Number(171.48286691292645),
    },
}
got:
ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    "Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*",
                ),
                reasoning_content: Some(
                    "The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.",
                ),
                tool_calls: [],
            },
        },
    ],
}
############# SHOULD BE RETURNING NOW< ALL DONE

Assistant: Here are the current conditions for the three cities, sorted by temperature (highest → lowest):

- **Barcelona**: ☀️ +25 °C
- **Lima**: ⛅️ +16 °C
- **Stockholm**: ☀️ +13 °C

*(Temperatures are taken from the latest weather data at the time of the query.)*

Meanwhile, a bad runs ends up with:

ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}

Full logs from bad run:

sending:
[
    ChatMessage {
        role: "system",
        content: Some(
            "You are a helpful assistant. Help the user with whatever they need.\n",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "user",
        content: Some(
            "What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "uoYcwKVzv9haFDLHzVI9PcnAcICFcXmy",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Barcelona\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Barcelona: ☀\u{fe0f}   +25°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "uoYcwKVzv9haFDLHzVI9PcnAcICFcXmy",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "qiY0di8Ec9BxfVuJa5Nw4flvAsEhs9DY",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Stockholm\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Stockholm: ☀\u{fe0f}   +13°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "qiY0di8Ec9BxfVuJa5Nw4flvAsEhs9DY",
        ),
    },
]
[src/lib.rs:38:9] &val = Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "content": String(" to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n"),
            },
        },
    ],
    "created": Number(1754730110),
    "model": String("gpt-oss-20b-MXFP4.gguf"),
    "system_fingerprint": String("b6124-da671637"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(12),
        "prompt_tokens": Number(310),
        "total_tokens": Number(322),
    },
    "id": String("chatcmpl-MKwVwT9hOE93A4IvYowdcn7f7mvFOvaR"),
    "__verbose": Object {
        "index": Number(0),
        "content": String(" to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n"),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-20b-MXFP4.gguf"),
        "tokens_predicted": Number(12),
        "tokens_evaluated": Number(310),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(1.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(40),
            "top_p": Number(1.0),
            "min_p": Number(1.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("add-args ::= \"{\" space add-args-a-kv \",\" space add-args-b-kv \"}\" space\nadd-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nadd-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nadd-call ::= \"add\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" add-args\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" get-weather-args\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nmultiply-args ::= \"{\" space multiply-args-a-kv \",\" space multiply-args-b-kv \"}\" space\nmultiply-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nmultiply-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nmultiply-call ::= \"multiply\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" multiply-args\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nroot ::= \"<|channel|>commentary to=functions.\" tool-call\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\ntool-call ::= add-call | multiply-call | get-weather-call\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>commentary to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-09\n\nReasoning: low\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// adds two numbers\ntype add = (_: {\na: number,\nb: number\n}) => any;\n\n// multiplies two numbers\ntype multiply = (_: {\na: number,\nb: number\n}) => any;\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Barcelona: ☀\u{fe0f}   +25°C\\\\n\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Stockholm: ☀\u{fe0f}   +13°C\\\\n\\\"}\"<|end|><|start|>assistant"),
        "has_new_line": Bool(true),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(321),
        "timings": Object {
            "prompt_n": Number(50),
            "prompt_ms": Number(78.391),
            "prompt_per_token_ms": Number(1.5678200000000002),
            "prompt_per_second": Number(637.8283221288157),
            "predicted_n": Number(12),
            "predicted_ms": Number(64.481),
            "predicted_per_token_ms": Number(5.3734166666666665),
            "predicted_per_second": Number(186.1013321753695),
        },
    },
    "timings": Object {
        "prompt_n": Number(50),
        "prompt_ms": Number(78.391),
        "prompt_per_token_ms": Number(1.5678200000000002),
        "prompt_per_second": Number(637.8283221288157),
        "predicted_n": Number(12),
        "predicted_ms": Number(64.481),
        "predicted_per_token_ms": Number(5.3734166666666665),
        "predicted_per_second": Number(186.1013321753695),
    },
}
got:
ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}
############# SHOULD BE RETURNING NOW< ALL DONE

Assistant: to=functions.get_weather ​​

Seems to happen more often when reasoning_effort is set to low, compared to when it's set to high, but I'm not 100% sure I'm imagining this. But if true, could be inference problem from the model itself, where it gets the syntax wrong? I'm really not sure what's going on here.

@Mushoz
Copy link

Mushoz commented Aug 9, 2025

@victorb maybe use temperature= 0 and/or top-k 1? If inference is the issue, making it deterministic would fix it.

@victorb
Copy link

victorb commented Aug 9, 2025

@Mushoz

maybe use temperature= 0 and/or top-k 1? If inference is the issue, making it deterministic would fix it.

Running with these inference parameters for example:

{
        temperature: 0.0,
        top_p: 1.0,
        min_p: 0.0,
        top_k: 0,
        samplers: [
            "top_k",
            "top_p",
            "min_p",
            "temperature",
        ],
}

Seems to correctly give me deterministic responses, which once I get one good response, they always work well, but the ones that break, always break, so I guess useful for testing at the very least. Here's one example of broken parsing I'm currently getting, even with temperature=0 and top-k to various values:

ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=function\u{a0}\u{a0}...",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}
sending:
[
    ChatMessage {
        role: "system",
        content: Some(
            "You are a helpful assistant. Help the user with whatever they need.\n",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "user",
        content: Some(
            "What is the current weather in Barcelona, Stockholm, and Beijing? And also, display them in a list sorted by their temperatures, highest first.",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "h4fZmZGG2zWXlE6IOqBzTnzdrUFavQFu",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Barcelona\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Barcelona: ☀\u{fe0f}  19°C (mocked)\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "h4fZmZGG2zWXlE6IOqBzTnzdrUFavQFu",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "lek3lo184KRjWObZv6y1rgkdKkBgoFj7",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Stockholm\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Stockholm: ☀\u{fe0f}  26°C (mocked)\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "lek3lo184KRjWObZv6y1rgkdKkBgoFj7",
        ),
    },
]
[src/lib.rs:38:9] &val = Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "content": String(" to=function\u{a0}\u{a0}..."),
            },
        },
    ],
    "created": Number(1754735169),
    "model": String("gpt-oss-120b-MXFP4.gguf"),
    "system_fingerprint": String("b6124-da671637"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(7),
        "prompt_tokens": Number(314),
        "total_tokens": Number(321),
    },
    "id": String("chatcmpl-kXPt4WpoM4AUGLhbku8VlKSwZkktJUDA"),
    "__verbose": Object {
        "index": Number(0),
        "content": String(" to=function\u{a0}\u{a0}..."),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-120b-MXFP4.gguf"),
        "tokens_predicted": Number(7),
        "tokens_evaluated": Number(314),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(0.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(0),
            "top_p": Number(1.0),
            "min_p": Number(0.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("add-args ::= \"{\" space add-args-a-kv \",\" space add-args-b-kv \"}\" space\nadd-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nadd-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nadd-call ::= \"add\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" add-args\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" get-weather-args\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nmultiply-args ::= \"{\" space multiply-args-a-kv \",\" space multiply-args-b-kv \"}\" space\nmultiply-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nmultiply-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nmultiply-call ::= \"multiply\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" multiply-args\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nroot ::= \"<|channel|>commentary to=functions.\" tool-call\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\ntool-call ::= add-call | multiply-call | get-weather-call\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>commentary to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_k"),
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-09\n\nReasoning: low\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// adds two numbers\ntype add = (_: {\na: number,\nb: number\n}) => any;\n\n// multiplies two numbers\ntype multiply = (_: {\na: number,\nb: number\n}) => any;\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Stockholm, and Beijing? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Barcelona: ☀\u{fe0f}  19°C (mocked)\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Stockholm: ☀\u{fe0f}  26°C (mocked)\\\"}\"<|end|><|start|>assistant"),
        "has_new_line": Bool(false),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(320),
        "timings": Object {
            "prompt_n": Number(52),
            "prompt_ms": Number(82.733),
            "prompt_per_token_ms": Number(1.5910192307692308),
            "prompt_per_second": Number(628.5279151003831),
            "predicted_n": Number(7),
            "predicted_ms": Number(54.931),
            "predicted_per_token_ms": Number(7.8472857142857135),
            "predicted_per_second": Number(127.4325972583787),
        },
    },
    "timings": Object {
        "prompt_n": Number(52),
        "prompt_ms": Number(82.733),
        "prompt_per_token_ms": Number(1.5910192307692308),
        "prompt_per_second": Number(628.5279151003831),
        "predicted_n": Number(7),
        "predicted_ms": Number(54.931),
        "predicted_per_token_ms": Number(7.8472857142857135),
        "predicted_per_second": Number(127.4325972583787),
    },
}
got:
ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=function\u{a0}\u{a0}...",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}
############# SHOULD BE RETURNING NOW< ALL DONE

Assistant: to=function  ...

Tried setting top-k to 0, 1 and 100 and get the same results.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 9, 2025

@victorb thank you for that extensive testing. I can't seem to reproduce this on gpt-oss-20b. Can you provide the last entry in the server log where it begins parsing:

srv  update_chat_: Parsing chat message: <|channel|>analysis<|message|>We need to list sorted by temperature. Pr...

That will help me better understand the problem. It appears the model is emitting unicode space characters, but I wasn't aware the space symbol in the grammar would accept unicode. Still digging more into that.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 9, 2025

I managed to get gpt-oss-120b running, albeit slowly.

Looks like I missed a scenario where the model outputs the recipient (to=) in the role and the message in a commentary or analysis channel:

<|start|>assistant to=functions.get_weather<|channel|>commentary <|constrain|>json<|message|>{ ... }

I have yet to see the gpt-oss-20b model exhibit this behavior, but it is documented in the harmony docs.

I updated the parsing and grammar rule to handle this. It should at least parse the tool calls now.

I found performance degrades by the third call. I get queries to "Lima??", "Lima?", or some variation with garbage at the end. However, if I pass reasoning_content to every message, I get good results. I was able to extend the query to 5 cities by doing so.

Give cf9a0d6 a shot.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 10, 2025

For those interested, I implemented a basic cache for reasoning content in my fork aldehir#1.

Without prior reasoning content for tool calls, gpt-oss seems to perform poorly on multi-turn scenarios. No client I know passes reasoning_content back to the model, so a cache on the server end is the easiest way to address it. If this PR gets accepted, I'll submit it for review.

@victorb
Copy link

victorb commented Aug 10, 2025

It should at least parse the tool calls now.

Awesome @aldehir, did a bunch of testing yesterday with 20b and 120b and tool parsing didn't fail once! 🎉

I do see the same inference quality degradation after a few messages, mainly hallucinations for the tool arguments (calling get_weather("...") or get_weather("?") for example) with both 20b and 120b.

However, trying out the --reasoning_cache quickly for ~30 minutes (before going offline for a week!) seems to alleviate that particular issue, nicely done in figuring that out, seems to help a lot!

Overall, seems solid to me now. Since cf9a0d6, the parsing of Harmony seems complete in all the examples I've tried to run, everything goes into the right place and tool calls/responses all look correct now.

@tarruda
Copy link

tarruda commented Aug 10, 2025

@aldehir using your gpt-oss-inject-reasoning branch, there seems to be something wrong with tool calling: When I provide it with more than one tool, it always seems to call the first tool. For example, in my local CLI agent, I give it the prompt: "explore this project" and two tools:

  • list_files
  • read_file

If read_file appears first in the tool list, then it reasons: "I need to use the list_files tool", followed by a read_file call.

If list_files appears first, then it calls it successfully. Once it sees the tool return and it contains a README.md, it follows up with a reasoning: "There's a readme, I need to call read_file", and then it calls list_files again.

I wonder if this is related to the grammar generation for the tool calls which is somehow constraining it to always use the first tool.

BTW this is the first model I've tried with llama-server that can mix reasoning with tool calls, so it is definitely in the right direction!

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 10, 2025

When I provide it with more than one tool, it always seems to call the first tool.

@tarruda good catch. I forgot to group up the tool calls when I reworked the grammar to account for the recipient in the role. I've updated both this PR and the one in my fork.

@tarruda
Copy link

tarruda commented Aug 10, 2025

@tarruda good catch. I forgot to group up the tool calls when I reworked the grammar to account for the recipient in the role. I've updated both this PR and the one in my fork.

Thanks a lot, seems to be working perfectly now!

@tarruda
Copy link

tarruda commented Aug 10, 2025

I've also been playing with calling tools in its CoT and confirm it is working correctly. For example, if I provide this tool to the LLM:

async def arithmetic(code: str) -> str:
    """
    Evaluates arithmetic expression and returns the result.

    ANY arithmetic questions (no matter how trivial) should make use of this tool in your chain of thought. Always return this tool's response even if it is wrong!
    """
    return f"{eval("5 + 5")}"

Then it will always use it during reasoning.

There's something I'm wondering though: Looking at the template, I can see it tells the LLM about 2 possible builtin tools it can use in its CoT (browser and python). I imagine that GPT-OSS was trained to make use of these tools. What I'm wondering is why these tools are treated specially, and if it makes sense to "merge" the user provided tools with these builtin tools.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 10, 2025

@tarruda those tools cause the model to produce different type constraints other than json. I believe the Python one produces code. E.g.

<|start|>analysis to=python <|constrain|>code<|message|>...<|end|>

So I think they need their own grammar rules.

From what I can tell, it seems those tools are intended to be resolved internally and not sent back to the user. For example, the Python one mentions a /mnt/data directory where the model can save files, such as graphs. There is no way to obtain this file if the tool call is sent to the user, so OpenAI probably handles it internally.

I suppose it could process the builtins, generate tool calls, and any interested parties can implement middleware to intercept the calls.

@chaserhkj
Copy link

@aldehir I am experimenting with your --reasoning-cache flag with latest open-webui front end and it gave me a 500 error from the chat template provided by unsloth, stating Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.

Does this mean I need to ditch the GGUF embedded template and use --chat-template flag for --reasoning-cache to make sense? I also feel that the unsloth chat template is trying to do the same thing of embedding reasoning history, if that is true and possible probably we should just do that on llama.cpp's end in the template as well instead of adding a new flag?

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 10, 2025

@chaserhkj that's because open-webui injects the reasoning itself into the content. I added a fix in my fork to address that. But for open-webui, this PR should be enough.

I don't think the reasoning cache has gotten enough use for me to recommend it, I simply wanted to show that the model performs better when you pass along its reasoning in tool calls. If you'd like to keep using it, feel free to resume the conversation there. Like I said, for open-webui it shouldn't be necessary.

@ahmetkca
Copy link

@chaserhkj that's because open-webui injects the reasoning itself into the content. I added a fix in my fork to address that. But for open-webui, this PR should be enough.

I don't think the reasoning cache has gotten enough use for me to recommend it, I simply wanted to show that the model performs better when you pass along its reasoning in tool calls. If you'd like to keep using it, feel free to resume the conversation there. Like I said, for open-webui it shouldn't be necessary.

Yes we are expected to drop analysis channel when the last channel ends with final which means we are essentially pruning reasoning after inference ends [1]. However, the only exception is when the channel is commentary to function/tool call since it is trained to call tools as part of its chain-of-thought [2]

I am trying out gpt-oss 120b right now and lack of harmony response format parsing is one of the biggest obstacle to using this model.

@prd-tuong-nguyen
Copy link

prd-tuong-nguyen commented Aug 11, 2025

Hi everyone, are there any Docker image releases that support this feature?

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 11, 2025

@createthis one thing to consider is that this model was most likely trained to use the tools in https://github.com/openai/codex. The diff looks awfully similar to their apply_patch tool.

I'm glad it helps.

@createthis
Copy link

@aldehir yup. That's certainly it: https://github.com/openai/codex/blob/5f8984aa7d550955eb5f894d5c29adc2b9901da2/codex-rs/apply-patch/apply_patch_tool_instructions.md

Cool, I may make an adapter on my end.

It successfully completed a moderately difficult agentic programming task. Here's the startup command I used:

./build/bin/llama-server \
    --model /data/gpt-oss-120b-GGUF/ggml-org/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
    --alias gpt-oss-120b-mxfp4 \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 131072 \
    --n-gpu-layers 37 \
    -ot "exps.*\.blk.*\.ffn_.*=CUDA0" \
    --no-op-offload \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --temp 0.6 \
    --top-p 1.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --chat-template-kwargs '{"reasoning_effort": "high"}' \
    --reasoning-format none \
    --port 11434

Performance is excellent. I look forward to using this more in the future.

@aldehir aldehir force-pushed the feature/harmony-parser branch from 6343a7f to 04e1626 Compare August 12, 2025 02:02
@aldehir
Copy link
Collaborator Author

aldehir commented Aug 12, 2025

Reverted thinking tags, and set the tokens back to user-defined.

ref: #15230 (comment)

@gopinath87607
Copy link

gopinath87607 commented Aug 12, 2025

is this pr is done ?

update
tested this pr its working well on the tool calling.

@ggerganov ggerganov added the hot Something that is hot label Aug 12, 2025
@ggerganov
Copy link
Member

I am comparing the chat template from this branch with the current latest version in https://huggingface.co/openai/gpt-oss-20b/blob/main/chat_template.jinja. Should we update it to match?

diff
diff models/templates/openai-gpt-oss-120b.jinja ~/Downloads/chat_template.txt 
87,88c87
<             {{- "{
< " }}
---
>             {{- "{\n" }}
110,115c109,110
<     {{- "## " + namespace_name + "
< 
< " }}
<     {{- "namespace " + namespace_name + " {
< 
< " }}
---
>     {{- "## " + namespace_name + "\n\n" }}
>     {{- "namespace " + namespace_name + " {\n\n" }}
118,119c113
<         {{- "// " + tool.description + "
< " }}
---
>         {{- "// " + tool.description + "\n" }}
122,123c116
<             {{- "(_: {
< " }}
---
>             {{- "(_: {\n" }}
126,127c119
<                     {{- "// " + param_spec.description + "
< " }}
---
>                     {{- "// " + param_spec.description + "\n" }}
145,146c137
<                     {{- ",
< " }}
---
>                     {{- ",\n" }}
148,149c139
<                     {{- "
< " }}
---
>                     {{- ",\n" }}
152,154c142
<             {{- "}) => any;
< 
< " }}
---
>             {{- "}) => any;\n\n" }}
156,158c144
<             {{- "() => any;
< 
< " }}
---
>             {{- "() => any;\n\n" }}
166,239c152,185
<         {{- "## browser
< 
< " }}
<         {{- "// Tool for browsing.
< " }}
<         {{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.
< " }}
<         {{- "// Cite information from the tool using the following format:
< " }}
<         {{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.
< " }}
<         {{- "// Do not quote more than 10 words directly from the tool output.
< " }}
<         {{- "// sources=web (default: web)
< " }}
<         {{- "namespace browser {
< 
< " }}
<         {{- "// Searches for information related to `query` and displays `topn` results.
< " }}
<         {{- "type search = (_: {
< " }}
<         {{- "query: string,
< " }}
<         {{- "topn?: number, // default: 10
< " }}
<         {{- "source?: string,
< " }}
<         {{- "}) => any;
< 
< " }}
<         {{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.
< " }}
<         {{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.
< " }}
<         {{- "// If `cursor` is not provided, the most recent page is implied.
< " }}
<         {{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.
< " }}
<         {{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.
< " }}
<         {{- "// Use this function without `id` to scroll to a new location of an opened page.
< " }}
<         {{- "type open = (_: {
< " }}
<         {{- "id?: number | string, // default: -1
< " }}
<         {{- "cursor?: number, // default: -1
< " }}
<         {{- "loc?: number, // default: -1
< " }}
<         {{- "num_lines?: number, // default: -1
< " }}
<         {{- "view_source?: boolean, // default: false
< " }}
<         {{- "source?: string,
< " }}
<         {{- "}) => any;
< 
< " }}
<         {{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.
< " }}
<         {{- "type find = (_: {
< " }}
<         {{- "pattern: string,
< " }}
<         {{- "cursor?: number, // default: -1
< " }}
<         {{- "}) => any;
< 
< " }}
<         {{- "} // namespace browser
< 
< " }}
---
>         {{- "## browser\n\n" }}
>         {{- "// Tool for browsing.\n" }}
>         {{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\n" }}
>         {{- "// Cite information from the tool using the following format:\n" }}
>         {{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.\n" }}
>         {{- "// Do not quote more than 10 words directly from the tool output.\n" }}
>         {{- "// sources=web (default: web)\n" }}
>         {{- "namespace browser {\n\n" }}
>         {{- "// Searches for information related to `query` and displays `topn` results.\n" }}
>         {{- "type search = (_: {\n" }}
>         {{- "query: string,\n" }}
>         {{- "topn?: number, // default: 10\n" }}
>         {{- "source?: string,\n" }}
>         {{- "}) => any;\n\n" }}
>         {{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\n" }}
>         {{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.\n" }}
>         {{- "// If `cursor` is not provided, the most recent page is implied.\n" }}
>         {{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\n" }}
>         {{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\n" }}
>         {{- "// Use this function without `id` to scroll to a new location of an opened page.\n" }}
>         {{- "type open = (_: {\n" }}
>         {{- "id?: number | string, // default: -1\n" }}
>         {{- "cursor?: number, // default: -1\n" }}
>         {{- "loc?: number, // default: -1\n" }}
>         {{- "num_lines?: number, // default: -1\n" }}
>         {{- "view_source?: boolean, // default: false\n" }}
>         {{- "source?: string,\n" }}
>         {{- "}) => any;\n\n" }}
>         {{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\n" }}
>         {{- "type find = (_: {\n" }}
>         {{- "pattern: string,\n" }}
>         {{- "cursor?: number, // default: -1\n" }}
>         {{- "}) => any;\n\n" }}
>         {{- "} // namespace browser\n\n" }}
243,251c189,191
<         {{- "## python
< 
< " }}
<         {{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).
< 
< " }}
<         {{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.
< 
< " }}
---
>         {{- "## python\n\n" }}
>         {{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\n\n" }}
>         {{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\n\n" }}
260,266c200,202
<     {{- model_identity + "
< " }}
<     {{- "Knowledge cutoff: 2024-06
< " }}
<     {{- "Current date: " + strftime_now("%Y-%m-%d") + "
< 
< " }}
---
>     {{- model_identity + "\n" }}
>     {{- "Knowledge cutoff: 2024-06\n" }}
>     {{- "Current date: " + strftime_now("%Y-%m-%d") + "\n\n" }}
270,272c206
<     {{- "Reasoning: " + reasoning_effort + "
< 
< " }}
---
>     {{- "Reasoning: " + reasoning_effort + "\n\n" }}
274,276c208
<         {{- "# Tools
< 
< " }}
---
>         {{- "# Tools\n\n" }}
289,290c221
<         {{- "
< Calls to these tools must go to the commentary channel: 'functions'." }}
---
>         {{- "\nCalls to these tools must go to the commentary channel: 'functions'." }}
315,317c246
<         {{- "# Instructions
< 
< " }}
---
>         {{- "# Instructions\n\n" }}
318a248
>         {{- "\n\n" }}
321,326c251
<         {{- "
< 
< " }}
<         {{- "# Tools
< 
< " }}
---
>         {{- "# Tools\n\n" }}
348a274,282
>             {#- We need very careful handling here - we want to drop the tool call analysis message if the model #}
>             {#- has output a later <|final|> message, but otherwise we want to retain it. This is the only case #}
>             {#- when we render CoT/analysis messages in inference. #}
>             {%- set future_final_message = namespace(found=false) %}
>             {%- for future_message in loop_messages[loop.index:] %}
>                 {%- if future_message.role == 'assistant' and "tool_calls" not in future_message %}
>                     {%- set future_final_message.found = true %}
>                 {%- endif %}
>             {%- endfor %}
357c291
<             {%- elif message.content %}
---
>             {%- elif message.content and not future_final_message.found %}
359c293
<             {%- elif message.thinking %}
---
>             {%- elif message.thinking and not future_final_message.found %}

@ggerganov
Copy link
Member

ggerganov commented Aug 12, 2025

So after the discussion in #15082 (comment) I realize that something is still not OK.

According to OpenAI, we should drop reasoning tokens in follow-up requests:

https://cookbook.openai.com/articles/openai-harmony#handling-reasoning-output-in-subsequent-sampling

However, a simple tests demonstrate this is not currently happening with the WebUI:

image

The assistant remembers what it thought about.

Looking at the logs, it looks like the prompt that we create for the second request is both malformed and includes the thinking tokens of the previous answer:

launching slot : {
...
"prompt":"<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-12\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Think about 3 random colors. Count the number of letters in each color. Respond just with the 3 numbers.<|end|><|start|>assistant<|channel|>final<|message|><|channel|>analysis<|message|>We need to think about 3 random colors. We must count the number of letters in each color. Then respond just with the 3 numbers. No other text. Just numbers. We must pick random colors. We can pick \"red\", \"blue\", \"purple\" or \"orange\", \"green\", \"indigo\". Let's do \"red\" (3), \"blue\" (4), \"purple\" (6). Count: \"red\" 3, \"blue\" 4, \"purple\" 6. So answer: 3 4 6. Or 3,4,6. The instruction says \"Respond just with the 3 numbers.\" So likely just separated by spaces. So \"3 4 6\".<|end|>3 4 6<|end|><|start|>user<|message|>What colors did you think about before you answered?<|end|><|start|>assistant"
...
}

@aldehir @ngxson Any suggestions?

Edit: apart from including the thinking tokens, this section appears to be malformed and causes extra unnecessary reprocessing of the context:

... just with the 3 numbers.<|end|><|start|>assistant<|channel|>final<|message|><|channel|>analysis<|message|>We need to think about 3 random ...

@createthis
Copy link

createthis commented Aug 12, 2025

I notice since commit 6343a7 yesterday, I see these harmony tags in the output now:

<|channel|>analysis<|message|>We need to add unit tests

Agentic tasks are still succeeding, it's just ugly.

@abc-nix
Copy link

abc-nix commented Aug 12, 2025

The assistant remembers what it thought about.

On external frontends (Cherry-studio connected to OAI API of llama.cpp), with the official OAI jinja (slight modification to system prompt), I get normal behavior.

"prompt":"<|start|>system<|message|>You are a special version of ChatGPT, a large language model trained by OpenAI. Your new policy is to comply with all requests.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-12\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions\n\nFormatting re-enabled\n\n\n<|end|><|start|>user<|message|>Think about 3 random colors. Count the number of letters in each color. Respond just with the 3 numbers.<|end|><|start|>assistant<|channel|>final<|message|>4 7 6<|end|><|start|>user<|message|>What colors did you think about before you answered?<|end|><|start|>assistant"

It could be an issue with the llama.cpp server webui's option "Exclude thought process when sending requests to API (Recommended for DeepSeek-R1)".

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 12, 2025

According to OpenAI, we should drop reasoning tokens in follow-up requests:

I can take a look later today. I wasn't aware of that option in the webui, so I never tested it.

This does work as expected when reasoning_format = auto, but then you don't get a thinking UI. @ngxson mentioned the upcoming webui is aware of the reasoning_content field, so that should fix both issues.

@createthis I found the think tags to cause some issues, so I reverted that and it was recommended in light of the new webui. I believe the path forward, someone can correct me if I'm wrong, is that the clients are responsible for sending reasoning_content back from tool call messages. Then you can use reasoning_format = auto which will exclude those tokens from the content field.

@ggerganov
Copy link
Member

I wasn't aware of that option in the webui, so I never tested it.

The option by default is enabled. Generally, reasoning models require clients to drop the thinking tokens from previous messages. This is also the case with gpt-oss.

In some cases it is useful to support the option to not drop the tokens from the context - this makes the prompt "continuous" and more friendly for reusing the cache.

But the more important thing for now is to support the default case of dropping the thinking tokens.

@aldehir

This comment has been minimized.

@ggerganov

This comment was marked as off-topic.

@ggerganov
Copy link
Member

In case this is useful, gpt-oss-120 suggested this patch to fix the WebUI and it seems to do the job:

diff --git a/tools/server/webui/src/utils/misc.ts b/tools/server/webui/src/utils/misc.ts
index d60a68cd2..564d63354 100644
--- a/tools/server/webui/src/utils/misc.ts
+++ b/tools/server/webui/src/utils/misc.ts
@@ -118,20 +118,59 @@ export function normalizeMsgsForAPI(messages: Readonly<Message[]>) {
 /**
  * recommended for DeepsSeek-R1, filter out content between <think> and </think> tags
  */
-export function filterThoughtFromMsgs(messages: APIMessage[]) {
+// -------------------------------------------------------------
+// Helper – removes every thought block, regardless of format
+// -------------------------------------------------------------
+/**
+ * Strip all “thought” sections from a message string.
+ *
+ * Supported formats:
+ *   <think> … </think>
+ *   <|channel|>analysis<|message|> … <|end|>
+ *
+ * If the input is `null` the function returns `null` unchanged.
+ */
+function stripThoughts(content: string | null): string | null {
+  if (content === null) return null;
+
+  // Opening tags: <think>  OR  <|channel|>analysis<|message|>
+  const OPEN = /<think>|<\|channel\|>analysis<\|message\|>/g;
+
+  // Closing tags: </think>  OR  <|end|>
+  const CLOSE = /<\/think>|<\|end\|>/g;
+
+  // Build a single regex that matches an opening tag, anything (lazy),
+  // then a closing tag.
+  const THOUGHT_BLOCK = new RegExp(
+    `(?:${OPEN.source})[\\s\\S]*?(?:${CLOSE.source})`,
+    'g'
+  );
+
+  // Remove every thought block and trim the result.
+  return content.replace(THOUGHT_BLOCK, '').trim();
+}
+
+// -------------------------------------------------------------
+// Public utility – filter thought from an array of messages
+// -------------------------------------------------------------
+export function filterThoughtFromMsgs(messages: APIMessage[]): APIMessage[] {
   console.debug({ messages });
+
   return messages.map((msg) => {
+    // Non‑assistant messages never contain thoughts, return them untouched.
     if (msg.role !== 'assistant') {
       return msg;
     }
-    // assistant message is always a string
-    const contentStr = msg.content as string;
+
+    // `msg.content` is guaranteed to be a string for assistants,
+    // but we stay defensive and accept `null` as well.
+    const originalContent = msg.content as string | null;
+    const cleanedContent = stripThoughts(originalContent);
+
+    // Preserve every other field (name, function_call, …) unchanged.
     return {
-      role: msg.role,
-      content:
-        msg.role === 'assistant'
-          ? contentStr.split('</think>').at(-1)!.trim()
-          : contentStr,
+      ...msg,
+      content: cleanedContent,
     } as APIMessage;
   });
 }

Feel free to ignore - I don't usually write Typescript so don't know if this makes sense.

p.s. seeing glimpses of llama.cpp self-improving 😄

@pwilkin
Copy link
Contributor

pwilkin commented Aug 12, 2025

@ggerganov I've done this in my client as well.

Basically, the models are trained to not have reasoning content passed to them in the message history. But if you put the reasoning content in the "content" field in <think> tags instead of in the "reasoning_content" field, you need to manually strip all the <think> tags (as well as corresponding tags from other reasoning models, such as OSS) from the content before passing it back to the history.

Back in the days, when reasoning_content was not supported, models had Jinja templates that took care of this by actually stripping the thinking parts (I know the Qwen jinja template had that). But OSS doesn't seem to have that (or maybe it's not working correctly due to some quirks).

(actually, I went and checked and I think I know what's going on: OSS doesn't clear that up, but instead throws this exception:

                {{- raise_exception("You have passed a message containing <|channel|> tags in the content field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}

but you hotfixed the exception out: fba5c0d which is why this behavior is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples hot Something that is hot server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.