gpt-oss: implement harmony parsing #15181

aldehir · 2025-08-08T18:51:44Z

This is my attempt at implementing a harmony parser for gpt-oss.

Implementation

Reasoning format support - both auto and none are supported. When none, <|channel|>analysis<|message|>{reasoning content}<|end|> is added to the content.
Tool parsing - tool parsing and grammar implemented. If parse_tool_calls == false, tool calls are added to the content verbatim--which aligns with other implementations.
Commentary preamble - the harmony format allows for a preamble in the commentary channel. If present, it is added to the content.
Tests added - perhaps too many test cases. I wanted to ensure proper parsing of partial messages.

Remaining Work

The harmony format specifies that reasoning content from the assistant's last tool call should be included in the next prompt. This implementation assumes it comes from the client in reasoning_content. However, none of the clients I tested send it. A simple workaround is to use reasoning_format = none, or add the reasoning to the content in tool calls.

abc-nix · 2025-08-08T20:30:35Z

Thanks. It finally made it much easier to use tools in Cherry Studio. And it generates thinking boxes properly.

dagbs · 2025-08-08T22:39:40Z

Without the PR:

With the PR:
using gpt-oss-20b:f16 from unsloth with the updated gguf

It's better, easily more usable, but there might be some issues around tool calling still.

aldehir · 2025-08-09T00:11:37Z

@dagbs try setting function calling to native in open-webui

victorb · 2025-08-09T07:37:18Z

I tried this PR yesterday and compared it to #15158 (+ my own fixes on top of that PR) and there was a couple of issues with this PR (that I was gonna share this morning), but since da67163 was pushed, it seems to finally work better than that PR. In my (albeit limited) testing, seems tool calling and it's formatting is working a lot better. Thanks a ton for this patch @aldehir!

All the unit tests pass as well, compared to the other PR, and code organization at a glance seems better too, but granted I'm not cpp expert, just an generalist.

victorb · 2025-08-09T09:05:42Z

Hmm, seems to still be breaking sometimes, tried to understand why but to no avail. Most of the time, it works perfectly fine, but seems some edge-case breaks it. Running da67163 right now.

If I repeatably use the same weather example maybe 10 times, I end up getting a badly parsed (on llama.cpp's side) maybe once.

Good run looks like this:

ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    "Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*",
                ),
                reasoning_content: Some(
                    "The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.",
                ),
                tool_calls: [],
            },
        },
    ],
}

sending:
[
    ChatMessage {
        role: "system",
        content: Some(
            "You are a helpful assistant. Help the user with whatever they need.\n",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "user",
        content: Some(
            "What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "ItCkpCeXs6jXspSwbFLidTHuATWM8MIj",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Barcelona\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Barcelona: ☀\u{fe0f}   +25°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "ItCkpCeXs6jXspSwbFLidTHuATWM8MIj",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "d92fjsjS8L5xBMTxSCmSWcNyhgISwo4u",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Stockholm\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Stockholm: ☀\u{fe0f}   +13°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "d92fjsjS8L5xBMTxSCmSWcNyhgISwo4u",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "0rIM7Xm598gzrRALjB4yMGZnuKRjOrSh",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Lima\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Lima: ⛅\u{fe0f}  +16°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "0rIM7Xm598gzrRALjB4yMGZnuKRjOrSh",
        ),
    },
]
[src/lib.rs:38:9] &val = Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "reasoning_content": String("The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer."),
                "content": String("Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**:  ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*"),
            },
        },
    ],
    "created": Number(1754730237),
    "model": String("gpt-oss-20b-MXFP4.gguf"),
    "system_fingerprint": String("b6124-da671637"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(440),
        "prompt_tokens": Number(361),
        "total_tokens": Number(801),
    },
    "id": String("chatcmpl-efjEpQIpXzIGe9j4F4gnC1X39B7mHOa3"),
    "__verbose": Object {
        "index": Number(0),
        "content": String("<|channel|>analysis<|message|>The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.\n\n<|end|><|start|>assistant<|channel|>final<|message|>Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*"),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-20b-MXFP4.gguf"),
        "tokens_predicted": Number(440),
        "tokens_evaluated": Number(361),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(1.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(40),
            "top_p": Number(1.0),
            "min_p": Number(1.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("add-args ::= \"{\" space add-args-a-kv \",\" space add-args-b-kv \"}\" space\nadd-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nadd-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nadd-call ::= \"add\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" add-args\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" get-weather-args\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nmultiply-args ::= \"{\" space multiply-args-a-kv \",\" space multiply-args-b-kv \"}\" space\nmultiply-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nmultiply-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nmultiply-call ::= \"multiply\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" multiply-args\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nroot ::= \"<|channel|>commentary to=functions.\" tool-call\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\ntool-call ::= add-call | multiply-call | get-weather-call\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>commentary to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-09\n\nReasoning: high\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// adds two numbers\ntype add = (_: {\na: number,\nb: number\n}) => any;\n\n// multiplies two numbers\ntype multiply = (_: {\na: number,\nb: number\n}) => any;\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Barcelona: ☀\u{fe0f}   +25°C\\\\n\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Stockholm: ☀\u{fe0f}   +13°C\\\\n\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Lima\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Lima: ⛅\u{fe0f}  +16°C\\\\n\\\"}\"<|end|><|start|>assistant"),
        "has_new_line": Bool(true),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(800),
        "timings": Object {
            "prompt_n": Number(49),
            "prompt_ms": Number(80.661),
            "prompt_per_token_ms": Number(1.6461428571428571),
            "prompt_per_second": Number(607.4806907923283),
            "predicted_n": Number(440),
            "predicted_ms": Number(2565.854),
            "predicted_per_token_ms": Number(5.831486363636364),
            "predicted_per_second": Number(171.48286691292645),
        },
    },
    "timings": Object {
        "prompt_n": Number(49),
        "prompt_ms": Number(80.661),
        "prompt_per_token_ms": Number(1.6461428571428571),
        "prompt_per_second": Number(607.4806907923283),
        "predicted_n": Number(440),
        "predicted_ms": Number(2565.854),
        "predicted_per_token_ms": Number(5.831486363636364),
        "predicted_per_second": Number(171.48286691292645),
    },
}
got:
ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    "Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*",
                ),
                reasoning_content: Some(
                    "The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.",
                ),
                tool_calls: [],
            },
        },
    ],
}
############# SHOULD BE RETURNING NOW< ALL DONE

Assistant: Here are the current conditions for the three cities, sorted by temperature (highest → lowest):

- **Barcelona**: ☀️ +25 °C
- **Lima**: ⛅️ +16 °C
- **Stockholm**: ☀️ +13 °C

*(Temperatures are taken from the latest weather data at the time of the query.)*

Meanwhile, a bad runs ends up with:

ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}

Full logs from bad run:

sending:
[
    ChatMessage {
        role: "system",
        content: Some(
            "You are a helpful assistant. Help the user with whatever they need.\n",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "user",
        content: Some(
            "What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "uoYcwKVzv9haFDLHzVI9PcnAcICFcXmy",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Barcelona\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Barcelona: ☀\u{fe0f}   +25°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "uoYcwKVzv9haFDLHzVI9PcnAcICFcXmy",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "qiY0di8Ec9BxfVuJa5Nw4flvAsEhs9DY",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Stockholm\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Stockholm: ☀\u{fe0f}   +13°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "qiY0di8Ec9BxfVuJa5Nw4flvAsEhs9DY",
        ),
    },
]
[src/lib.rs:38:9] &val = Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "content": String(" to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n"),
            },
        },
    ],
    "created": Number(1754730110),
    "model": String("gpt-oss-20b-MXFP4.gguf"),
    "system_fingerprint": String("b6124-da671637"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(12),
        "prompt_tokens": Number(310),
        "total_tokens": Number(322),
    },
    "id": String("chatcmpl-MKwVwT9hOE93A4IvYowdcn7f7mvFOvaR"),
    "__verbose": Object {
        "index": Number(0),
        "content": String(" to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n"),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-20b-MXFP4.gguf"),
        "tokens_predicted": Number(12),
        "tokens_evaluated": Number(310),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(1.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(40),
            "top_p": Number(1.0),
            "min_p": Number(1.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("add-args ::= \"{\" space add-args-a-kv \",\" space add-args-b-kv \"}\" space\nadd-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nadd-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nadd-call ::= \"add\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" add-args\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" get-weather-args\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nmultiply-args ::= \"{\" space multiply-args-a-kv \",\" space multiply-args-b-kv \"}\" space\nmultiply-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nmultiply-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nmultiply-call ::= \"multiply\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" multiply-args\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nroot ::= \"<|channel|>commentary to=functions.\" tool-call\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\ntool-call ::= add-call | multiply-call | get-weather-call\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>commentary to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-09\n\nReasoning: low\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// adds two numbers\ntype add = (_: {\na: number,\nb: number\n}) => any;\n\n// multiplies two numbers\ntype multiply = (_: {\na: number,\nb: number\n}) => any;\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Barcelona: ☀\u{fe0f}   +25°C\\\\n\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Stockholm: ☀\u{fe0f}   +13°C\\\\n\\\"}\"<|end|><|start|>assistant"),
        "has_new_line": Bool(true),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(321),
        "timings": Object {
            "prompt_n": Number(50),
            "prompt_ms": Number(78.391),
            "prompt_per_token_ms": Number(1.5678200000000002),
            "prompt_per_second": Number(637.8283221288157),
            "predicted_n": Number(12),
            "predicted_ms": Number(64.481),
            "predicted_per_token_ms": Number(5.3734166666666665),
            "predicted_per_second": Number(186.1013321753695),
        },
    },
    "timings": Object {
        "prompt_n": Number(50),
        "prompt_ms": Number(78.391),
        "prompt_per_token_ms": Number(1.5678200000000002),
        "prompt_per_second": Number(637.8283221288157),
        "predicted_n": Number(12),
        "predicted_ms": Number(64.481),
        "predicted_per_token_ms": Number(5.3734166666666665),
        "predicted_per_second": Number(186.1013321753695),
    },
}
got:
ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}
############# SHOULD BE RETURNING NOW< ALL DONE

Assistant: to=functions.get_weather

Seems to happen more often when reasoning_effort is set to low, compared to when it's set to high, but I'm not 100% sure I'm imagining this. But if true, could be inference problem from the model itself, where it gets the syntax wrong? I'm really not sure what's going on here.

Mushoz · 2025-08-09T09:10:17Z

@victorb maybe use temperature= 0 and/or top-k 1? If inference is the issue, making it deterministic would fix it.

victorb · 2025-08-09T10:28:13Z

@Mushoz

maybe use temperature= 0 and/or top-k 1? If inference is the issue, making it deterministic would fix it.

Running with these inference parameters for example:

{
        temperature: 0.0,
        top_p: 1.0,
        min_p: 0.0,
        top_k: 0,
        samplers: [
            "top_k",
            "top_p",
            "min_p",
            "temperature",
        ],
}

Seems to correctly give me deterministic responses, which once I get one good response, they always work well, but the ones that break, always break, so I guess useful for testing at the very least. Here's one example of broken parsing I'm currently getting, even with temperature=0 and top-k to various values:

ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=function\u{a0}\u{a0}...",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}

sending:
[
    ChatMessage {
        role: "system",
        content: Some(
            "You are a helpful assistant. Help the user with whatever they need.\n",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "user",
        content: Some(
            "What is the current weather in Barcelona, Stockholm, and Beijing? And also, display them in a list sorted by their temperatures, highest first.",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "h4fZmZGG2zWXlE6IOqBzTnzdrUFavQFu",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Barcelona\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Barcelona: ☀\u{fe0f}  19°C (mocked)\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "h4fZmZGG2zWXlE6IOqBzTnzdrUFavQFu",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "lek3lo184KRjWObZv6y1rgkdKkBgoFj7",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Stockholm\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Stockholm: ☀\u{fe0f}  26°C (mocked)\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "lek3lo184KRjWObZv6y1rgkdKkBgoFj7",
        ),
    },
]
[src/lib.rs:38:9] &val = Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "content": String(" to=function\u{a0}\u{a0}..."),
            },
        },
    ],
    "created": Number(1754735169),
    "model": String("gpt-oss-120b-MXFP4.gguf"),
    "system_fingerprint": String("b6124-da671637"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(7),
        "prompt_tokens": Number(314),
        "total_tokens": Number(321),
    },
    "id": String("chatcmpl-kXPt4WpoM4AUGLhbku8VlKSwZkktJUDA"),
    "__verbose": Object {
        "index": Number(0),
        "content": String(" to=function\u{a0}\u{a0}..."),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-120b-MXFP4.gguf"),
        "tokens_predicted": Number(7),
        "tokens_evaluated": Number(314),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(0.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(0),
            "top_p": Number(1.0),
            "min_p": Number(0.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("add-args ::= \"{\" space add-args-a-kv \",\" space add-args-b-kv \"}\" space\nadd-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nadd-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nadd-call ::= \"add\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" add-args\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" get-weather-args\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nmultiply-args ::= \"{\" space multiply-args-a-kv \",\" space multiply-args-b-kv \"}\" space\nmultiply-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nmultiply-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nmultiply-call ::= \"multiply\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" multiply-args\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nroot ::= \"<|channel|>commentary to=functions.\" tool-call\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\ntool-call ::= add-call | multiply-call | get-weather-call\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>commentary to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_k"),
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-09\n\nReasoning: low\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// adds two numbers\ntype add = (_: {\na: number,\nb: number\n}) => any;\n\n// multiplies two numbers\ntype multiply = (_: {\na: number,\nb: number\n}) => any;\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Stockholm, and Beijing? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Barcelona: ☀\u{fe0f}  19°C (mocked)\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Stockholm: ☀\u{fe0f}  26°C (mocked)\\\"}\"<|end|><|start|>assistant"),
        "has_new_line": Bool(false),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(320),
        "timings": Object {
            "prompt_n": Number(52),
            "prompt_ms": Number(82.733),
            "prompt_per_token_ms": Number(1.5910192307692308),
            "prompt_per_second": Number(628.5279151003831),
            "predicted_n": Number(7),
            "predicted_ms": Number(54.931),
            "predicted_per_token_ms": Number(7.8472857142857135),
            "predicted_per_second": Number(127.4325972583787),
        },
    },
    "timings": Object {
        "prompt_n": Number(52),
        "prompt_ms": Number(82.733),
        "prompt_per_token_ms": Number(1.5910192307692308),
        "prompt_per_second": Number(628.5279151003831),
        "predicted_n": Number(7),
        "predicted_ms": Number(54.931),
        "predicted_per_token_ms": Number(7.8472857142857135),
        "predicted_per_second": Number(127.4325972583787),
    },
}
got:
ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=function\u{a0}\u{a0}...",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}
############# SHOULD BE RETURNING NOW< ALL DONE

Assistant: to=function  ...

Tried setting top-k to 0, 1 and 100 and get the same results.

aldehir · 2025-08-09T15:12:58Z

@victorb thank you for that extensive testing. I can't seem to reproduce this on gpt-oss-20b. Can you provide the last entry in the server log where it begins parsing:

srv  update_chat_: Parsing chat message: <|channel|>analysis<|message|>We need to list sorted by temperature. Pr...

That will help me better understand the problem. It appears the model is emitting unicode space characters, but I wasn't aware the space symbol in the grammar would accept unicode. Still digging more into that.

aldehir · 2025-08-09T19:57:34Z

I managed to get gpt-oss-120b running, albeit slowly.

Looks like I missed a scenario where the model outputs the recipient (to=) in the role and the message in a commentary or analysis channel:

<|start|>assistant to=functions.get_weather<|channel|>commentary <|constrain|>json<|message|>{ ... }

I have yet to see the gpt-oss-20b model exhibit this behavior, but it is documented in the harmony docs.

I updated the parsing and grammar rule to handle this. It should at least parse the tool calls now.

I found performance degrades by the third call. I get queries to "Lima??", "Lima?", or some variation with garbage at the end. However, if I pass reasoning_content to every message, I get good results. I was able to extend the query to 5 cities by doing so.

Give cf9a0d6 a shot.

aldehir · 2025-08-10T00:40:55Z

For those interested, I implemented a basic cache for reasoning content in my fork aldehir#1.

Without prior reasoning content for tool calls, gpt-oss seems to perform poorly on multi-turn scenarios. No client I know passes reasoning_content back to the model, so a cache on the server end is the easiest way to address it. If this PR gets accepted, I'll submit it for review.

victorb · 2025-08-10T06:03:16Z

It should at least parse the tool calls now.

Awesome @aldehir, did a bunch of testing yesterday with 20b and 120b and tool parsing didn't fail once! 🎉

I do see the same inference quality degradation after a few messages, mainly hallucinations for the tool arguments (calling get_weather("...") or get_weather("?") for example) with both 20b and 120b.

However, trying out the --reasoning_cache quickly for ~30 minutes (before going offline for a week!) seems to alleviate that particular issue, nicely done in figuring that out, seems to help a lot!

Overall, seems solid to me now. Since cf9a0d6, the parsing of Harmony seems complete in all the examples I've tried to run, everything goes into the right place and tool calls/responses all look correct now.

tarruda · 2025-08-10T11:25:48Z

@aldehir using your gpt-oss-inject-reasoning branch, there seems to be something wrong with tool calling: When I provide it with more than one tool, it always seems to call the first tool. For example, in my local CLI agent, I give it the prompt: "explore this project" and two tools:

list_files
read_file

If read_file appears first in the tool list, then it reasons: "I need to use the list_files tool", followed by a read_file call.

If list_files appears first, then it calls it successfully. Once it sees the tool return and it contains a README.md, it follows up with a reasoning: "There's a readme, I need to call read_file", and then it calls list_files again.

I wonder if this is related to the grammar generation for the tool calls which is somehow constraining it to always use the first tool.

BTW this is the first model I've tried with llama-server that can mix reasoning with tool calls, so it is definitely in the right direction!

aldehir · 2025-08-10T11:47:23Z

When I provide it with more than one tool, it always seems to call the first tool.

@tarruda good catch. I forgot to group up the tool calls when I reworked the grammar to account for the recipient in the role. I've updated both this PR and the one in my fork.

tarruda · 2025-08-10T12:01:08Z

@tarruda good catch. I forgot to group up the tool calls when I reworked the grammar to account for the recipient in the role. I've updated both this PR and the one in my fork.

Thanks a lot, seems to be working perfectly now!

tarruda · 2025-08-10T14:08:31Z

I've also been playing with calling tools in its CoT and confirm it is working correctly. For example, if I provide this tool to the LLM:

async def arithmetic(code: str) -> str:
    """
    Evaluates arithmetic expression and returns the result.

    ANY arithmetic questions (no matter how trivial) should make use of this tool in your chain of thought. Always return this tool's response even if it is wrong!
    """
    return f"{eval("5 + 5")}"

Then it will always use it during reasoning.

There's something I'm wondering though: Looking at the template, I can see it tells the LLM about 2 possible builtin tools it can use in its CoT (browser and python). I imagine that GPT-OSS was trained to make use of these tools. What I'm wondering is why these tools are treated specially, and if it makes sense to "merge" the user provided tools with these builtin tools.

aldehir · 2025-08-10T16:40:20Z

@tarruda those tools cause the model to produce different type constraints other than json. I believe the Python one produces code. E.g.

<|start|>analysis to=python <|constrain|>code<|message|>...<|end|>

So I think they need their own grammar rules.

From what I can tell, it seems those tools are intended to be resolved internally and not sent back to the user. For example, the Python one mentions a /mnt/data directory where the model can save files, such as graphs. There is no way to obtain this file if the tool call is sent to the user, so OpenAI probably handles it internally.

I suppose it could process the builtins, generate tool calls, and any interested parties can implement middleware to intercept the calls.

chaserhkj · 2025-08-10T21:11:37Z

@aldehir I am experimenting with your --reasoning-cache flag with latest open-webui front end and it gave me a 500 error from the chat template provided by unsloth, stating Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.

Does this mean I need to ditch the GGUF embedded template and use --chat-template flag for --reasoning-cache to make sense? I also feel that the unsloth chat template is trying to do the same thing of embedding reasoning history, if that is true and possible probably we should just do that on llama.cpp's end in the template as well instead of adding a new flag?

aldehir · 2025-08-10T22:03:50Z

@chaserhkj that's because open-webui injects the reasoning itself into the content. I added a fix in my fork to address that. But for open-webui, this PR should be enough.

I don't think the reasoning cache has gotten enough use for me to recommend it, I simply wanted to show that the model performs better when you pass along its reasoning in tool calls. If you'd like to keep using it, feel free to resume the conversation there. Like I said, for open-webui it shouldn't be necessary.

ahmetkca · 2025-08-11T03:30:15Z

@chaserhkj that's because open-webui injects the reasoning itself into the content. I added a fix in my fork to address that. But for open-webui, this PR should be enough.

I don't think the reasoning cache has gotten enough use for me to recommend it, I simply wanted to show that the model performs better when you pass along its reasoning in tool calls. If you'd like to keep using it, feel free to resume the conversation there. Like I said, for open-webui it shouldn't be necessary.

Yes we are expected to drop analysis channel when the last channel ends with final which means we are essentially pruning reasoning after inference ends [1]. However, the only exception is when the channel is commentary to function/tool call since it is trained to call tools as part of its chain-of-thought [2]

I am trying out gpt-oss 120b right now and lack of harmony response format parsing is one of the biggest obstacle to using this model.

prd-tuong-nguyen · 2025-08-11T06:01:59Z

Hi everyone, are there any Docker image releases that support this feature?

ggerganov

I think this should be good to merge - did some testing and tool calls are working both with CC and llama.vscode agent.

@ngxson Let's fix any potential problems from master.

saadsafi · 2025-08-14T14:45:20Z

gpt-oss works now in my MCP tests.
excellent work, many thanks!

ggerganov · 2025-08-14T15:19:00Z

@saadsafi Thank you for the feedback - please let us know if something does not work right.

@aldehir Great job at handling this PR and the related requests 👍

isgallagher · 2025-08-16T16:17:54Z

@ggerganov This commit has broken llama-server for gpt-oss-20b in chat software such as Jan and RecurseChat. Both clients are treating all of the response as "Thought". I am running the server with this command:
/opt/llama.cpp/bin/llama-server --port 9009 --model /ai_models/LLMs/unsloth/OpenAI/gpt-oss-20b-Q8_0.gguf --ctx-size 0 --n-gpu-layers 99 --no-mmap --keep -1 --escape --flash-attn --no-webui --jinja --reasoning-format none.

When I set reasoning-format to auto, both clients are not showing any "Thought" or thinking response at all, only the final message.

Is this a bug with Jan and RecurseChat Harmony implementation or did we remove a critical part, e.g. the <|channel|>final<|message|>? Sorry not sure where the problem is, but I did track it down to this commit. Any help would be appreciated, thank you!

semidark · 2025-08-16T17:19:14Z

I notice since commit 6343a7 yesterday, I see these harmony tags in the output now:

<|channel|>analysis<|message|>We need to add unit tests

Agentic tasks are still succeeding, it's just ugly.

I can see something similar in my open Webui Output:

A lot better than before, but it is still not getting rid of all Harmony Tags even for simple tasks

victorb · 2025-08-16T17:40:58Z

20b in chat software such as Jan and RecurseChat

@isgallagher @semidark Are you sure this isn't a problem in their implementations, rather than llama.cpp?

I'm asking as I'm not seeing the same in my own tests, but I'm using my own client library. Currently using llama.cpp 1fe0029, ran like this:

$ ./build/bin/llama-server \
  -m /mnt/nas/models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
  --ctx-size 131072 \
  --flash-attn \
  --gpu-layers 99 \
  --threads $(nproc) --threads-batch $(nproc) \
  --jinja \
  --verbose-prompt -v

And everything seems to be parsed and returned correctly. Example:

Received response:
Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "reasoning_content": String("We have weather data for each city. Need list sorted by temperature descending. Let's collect:\n\nLisbon: 30°C\nOslo: 28°C\nStockholm: 26°C\nBucharest: 26°C\nCopenhagen: 21°C\nRome: 20°C\nBarcelona: 19°C\nHelsinki: 19°C\nParis: 18°C\nWarsaw: 12°C\n\nSort: 30,28,26,26,21,20,19,19,18,12.\n\nIf ties order arbitrary but we can preserve alphabetical for ties. For 26: Bucharest and Stockholm - alphabetical Bucharest then Stockholm? Actually alphabetical B < S, ok. For 19: Barcelona, Helsinki alphabetical B < H.\n\nNow produce list."),
                "content": String("**Current temperatures (sorted highest\u{202f}→\u{202f}lowest)**  \n\n1. **Lisbon** – 30\u{202f}°C  \n2. **Oslo** – 28\u{202f}°C  \n3. **Bucharest** – 26\u{202f}°C  \n4. **Stockholm** – 26\u{202f}°C  \n5. **Copenhagen** – 21\u{202f}°C  \n6. **Rome** – 20\u{202f}°C  \n7. **Barcelona** – 19\u{202f}°C  \n8. **Helsinki** – 19\u{202f}°C  \n9. **Paris** – 18\u{202f}°C  \n10. **Warsaw** – 12\u{202f}°C  "),
            },
        },
    ],
    "created": Number(1755366008),
    "model": String("gpt-oss-120b-MXFP4.gguf"),
    "system_fingerprint": String("b6182-1fe00296"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(302),
        "prompt_tokens": Number(700),
        "total_tokens": Number(1002),
    },
    "id": String("chatcmpl-Z7JqIOyg2FW7BvkpfUlgChv6lCIRZwsv"),
    "__verbose": Object {
        "index": Number(0),
        "content": String("<|channel|>analysis<|message|>We have weather data for each city. Need list sorted by temperature descending. Let's collect:\n\nLisbon: 30°C\nOslo: 28°C\nStockholm: 26°C\nBucharest: 26°C\nCopenhagen: 21°C\nRome: 20°C\nBarcelona: 19°C\nHelsinki: 19°C\nParis: 18°C\nWarsaw: 12°C\n\nSort: 30,28,26,26,21,20,19,19,18,12.\n\nIf ties order arbitrary but we can preserve alphabetical for ties. For 26: Bucharest and Stockholm - alphabetical Bucharest then Stockholm? Actually alphabetical B < S, ok. For 19: Barcelona, Helsinki alphabetical B < H.\n\nNow produce list.\n\n<|end|><|start|>assistant<|channel|>final<|message|>**Current temperatures (sorted highest\u{202f}→\u{202f}lowest)**  \n\n1. **Lisbon** – 30\u{202f}°C  \n2. **Oslo** – 28\u{202f}°C  \n3. **Bucharest** – 26\u{202f}°C  \n4. **Stockholm** – 26\u{202f}°C  \n5. **Copenhagen** – 21\u{202f}°C  \n6. **Rome** – 20\u{202f}°C  \n7. **Barcelona** – 19\u{202f}°C  \n8. **Helsinki** – 19\u{202f}°C  \n9. **Paris** – 18\u{202f}°C  \n10. **Warsaw** – 12\u{202f}°C  "),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-120b-MXFP4.gguf"),
        "tokens_predicted": Number(302),
        "tokens_evaluated": Number(700),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(1.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(0),
            "top_p": Number(1.0),
            "min_p": Number(0.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("channel ::= \"<|channel|>\" ( \"commentary\" | \"analysis\" )\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\"channel \" <|constrain|>json\"? \"<|message|>\" get-weather-args\nget-weather-call0 ::= \"get_weather\" \" <|constrain|>json\"? \"<|message|>\" get-weather-args\nrecipient-in-channel ::= channel \" to=functions.\" ( get-weather-call0 )\nrecipient-in-role ::= \"<|start|>assistant\"? \" to=functions.\" ( get-weather-call )\nroot ::= recipient-in-role | recipient-in-channel\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>(commentary|analysis) to"),
                },
                Object {
                    "type": Number(3),
                    "value": String("^ to"),
                },
                Object {
                    "type": Number(2),
                    "value": String("<\\|start\\|>assistant to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-16\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need. There is never any limit to how many tool calls you can make, go ahead and completely answer the user's query. If any tool calls fail, retry them again after receiving the parameters you tried to set initially. Make sure to follow the user request precisely. Double-check before writing your final response.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string,\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Lisbon, Stockholm, Bucharest, Rome, Paris, Warsaw, Copenhagen, Oslo and Helsinki? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Barcelona: 19°C\"}<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Lisbon\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Lisbon: 30°C\"}<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Stockholm: 26°C\"}<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Bucharest\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Bucharest: 26°C\"}<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Rome\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Rome: 20°C\"}<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Paris\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Paris: 18°C\"}<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Warsaw\u{a0}\u{a0}?\u{a0}\u{a0}??\u{a0}\u{a0}.......… …\u{a0}\u{a0}…\u{a0}...\u{a0}....\u{a0}......…\u{a0}.......……\u{a0}.........\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Warsaw\u{a0}\u{a0}?\u{a0}\u{a0}??\u{a0}\u{a0}.......… …\u{a0}\u{a0}…\u{a0}...\u{a0}....\u{a0}......…\u{a0}.......……\u{a0}.........: 12°C\"}<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Copenhagen\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Copenhagen: 21°C\"}<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Oslo\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Oslo: 28°C\"}<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\":\"Helsinki\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{\"result\":\"Helsinki: 19°C\"}<|end|><|start|>assistant"),
        "has_new_line": Bool(true),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(1001),
        "timings": Object {
            "prompt_n": Number(43),
            "prompt_ms": Number(94.801),
            "prompt_per_token_ms": Number(2.204674418604651),
            "prompt_per_second": Number(453.5817132730668),
            "predicted_n": Number(302),
            "predicted_ms": Number(6801.982),
            "predicted_per_token_ms": Number(22.523119205298013),
            "predicted_per_second": Number(44.39882375460565),
        },
    },
    "timings": Object {
        "prompt_n": Number(43),
        "prompt_ms": Number(94.801),
        "prompt_per_token_ms": Number(2.204674418604651),
        "prompt_per_second": Number(453.5817132730668),
        "predicted_n": Number(302),
        "predicted_ms": Number(6801.982),
        "predicted_per_token_ms": Number(22.523119205298013),
        "predicted_per_second": Number(44.39882375460565),
    },
}
Reasoning content: Some("We have weather data for each city. Need list sorted by temperature descending. Let's collect:\n\nLisbon: 30°C\nOslo: 28°C\nStockholm: 26°C\nBucharest: 26°C\nCopenhagen: 21°C\nRome: 20°C\nBarcelona: 19°C\nHelsinki: 19°C\nParis: 18°C\nWarsaw: 12°C\n\nSort: 30,28,26,26,21,20,19,19,18,12.\n\nIf ties order arbitrary but we can preserve alphabetical for ties. For 26: Bucharest and Stockholm - alphabetical Bucharest then Stockholm? Actually alphabetical B < S, ok. For 19: Barcelona, Helsinki alphabetical B < H.\n\nNow produce list.")
Content: Some("**Current temperatures (sorted highest\u{202f}→\u{202f}lowest)**  \n\n1. **Lisbon** – 30\u{202f}°C  \n2. **Oslo** – 28\u{202f}°C  \n3. **Bucharest** – 26\u{202f}°C  \n4. **Stockholm** – 26\u{202f}°C  \n5. **Copenhagen** – 21\u{202f}°C  \n6. **Rome** – 20\u{202f}°C  \n7. **Barcelona** – 19\u{202f}°C  \n8. **Helsinki** – 19\u{202f}°C  \n9. **Paris** – 18\u{202f}°C  \n10. **Warsaw** – 12\u{202f}°C  ")
############# SHOULD BE RETURNING NOW, ALL DONE:

Assistant final reply: **Current temperatures (sorted highest → lowest)**

1. **Lisbon** – 30 °C
2. **Oslo** – 28 °C
3. **Bucharest** – 26 °C
4. **Stockholm** – 26 °C
5. **Copenhagen** – 21 °C
6. **Rome** – 20 °C
7. **Barcelona** – 19 °C
8. **Helsinki** – 19 °C
9. **Paris** – 18 °C
10. **Warsaw** – 12 °C

I think the tests added from this PR confirms the correct behavior as well, as far as I can tell.

isgallagher · 2025-08-16T18:04:39Z

@victorb Hi Victor, thanks for the response here. I am still having issues even with curl...

curl -X POST "http://localhost:8080/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -d '{ "model": "unsloth-gpt-oss-20b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Translate the following English text to French: \"Hello, world!\""} ], "max_tokens": 60, "temperature": 0.7, "n": 1, "stop": null }'

{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"<|channel|>analysis<|message|>We need to translate \"Hello, world!\" to French. That's \"Bonjour, monde!\" but commonly \"Bonjour, le monde!\" But the literal translation is \"Bonjour, le monde!\" The user didn't specify context. So I'd give the translation.<|end|>Bonjour,"}}],"created":1755367006,"model":"unsloth-gpt-oss-20b","system_fingerprint":"b6164-b204a5a23","object":"chat.completion","usage":{"completion_tokens":60,"prompt_tokens":93,"total_tokens":153},"id":"chatcmpl-ZXaUabhu5nSlg2wdUqPnSPf0IYqhXYXV","timings":{"prompt_n":93,"prompt_ms":116.656,"prompt_per_token_ms":1.2543655913978495,"prompt_per_second":797.2157454395831,"predicted_n":60,"predicted_ms":367.563,"predicted_per_token_ms":6.12605,"predicted_per_second":163.23732258143502}}%

I am missing the final channel.

Here is the same curl with 646944cfa8961afd914dd6637739b3cda9a72e11.
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"<|channel|>analysis<|message|>We need to translate \"Hello, world!\" to French. The standard translation is \"Bonjour, le monde!\" Or \"Bonjour le monde!\" Usually with comma after \"Bonjour\". So answer: \"Bonjour, le monde!\"<|start|>assistant<|channel|>final<|message|>Bonjour, le monde!"}}],"created":1755367427,"model":"unsloth-gpt-oss-20b","system_fingerprint":"b6163-646944cfa","object":"chat.completion","usage":{"completion_tokens":59,"prompt_tokens":93,"total_tokens":152},"id":"chatcmpl-ieWnSZjXej7HslMKFbhZpjWU1O4yIBW7","timings":{"prompt_n":93,"prompt_ms":114.304,"prompt_per_token_ms":1.2290752688172044,"prompt_per_second":813.6198208286675,"predicted_n":59,"predicted_ms":359.717,"predicted_per_token_ms":6.096898305084745,"predicted_per_second":164.0178251236389}}%

victorb · 2025-08-16T18:15:37Z

@isgallagher that seems to be because of the reasoning_format none that you set when starting the server. Seems when setting that, you're leaving it up to the client to parse Harmony if I understand things correctly. If you set it to auto, you'll get already split reasoning_content/content instead. So set auto for clients built with the reasoning_content/content split in mind, and none for clients who've been built with their own Harmony parsing.

With reasoning_format set to none:

$ curl -X POST "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" -d '{ "model": "unsloth-gpt-oss-20b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user",   "content": "Translate the following English text to French: \"Hello, world!\""} ], "max_tokens": 1024, "temperature": 1.0, "n": 1, "stop": null }' | jq .choices
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3591  100  3316  100   275   5261    436 --:--:-- --:--:-- --:--:--  5690
[
  {
    "finish_reason": "stop",
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "<|channel|>analysis<|message|>The user asks to translate \"Hello, world!\" to French. That's straightforward: \"Bonjour, le monde!\" or \"Bonjour, monde!\" The common translation is \"Bonjour, le monde !\" with space before exclamation. Provide translation.<|end|>« Bonjour, le monde ! »"
    }
  }
]

With reasoning_format set to auto:

$ curl -X POST "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" -d '{ "model": "unsloth-gpt-oss-20b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user",   "content": "Translate the following English text to French: \"Hello, world!\""} ], "max_tokens": 1024, "temperature": 1.0, "n": 1, "stop": null }' | jq .choices
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3543  100  3268  100   275   5100    429 --:--:-- --:--:-- --:--:--  5527
[
  {
    "finish_reason": "stop",
    "index": 0,
    "message": {
      "role": "assistant",
      "reasoning_content": "We need to translate \"Hello, world!\" into French. Simple: \"Bonjour, le monde !\" or \"Bonjour le monde!\" Usually \"Bonjour le monde!\" or \"Salut, le monde!\". Probably \"Bonjour, le monde !\" Provide translation.",
      "content": "**Bonjour, le monde !**"
    }
  }
]

(FYI: As far as I know, temperature needs to be set to either 1.0 or 0.0 for GPT-OSS, I'm guessing the model might even screw up the Harmony syntax itself outside of those values, but haven't tested outside of the values much myself so someone correct me if that's wrong)

Edit: Just re-read your first message:

When I set reasoning-format to auto, both clients are not showing any "Thought" or thinking response at all, only the final message.

I think this means they've either implemented it like that on purpose, or it's a bug in how they handle reasoning_content/content split.

aldehir · 2025-08-16T18:19:44Z

@isgallagher Increase the max tokens for your curl example. The analysis counts towards the token count so it never gets to the end. Also, use --reasoning-format auto to extract the reasoning to a different field. Other clients need to support the reasoning field, it doesn't seem like Jan supports the reasoning_content field. Not sure about RecurseChat, as I can't test it.

@semidark same deal, the --reasoning-format none option is really only used for the llama-server webui so it can properly render thinking elements. It's somewhat of a temporary fix until the new webui rolls out that natively supports the reasoning_content field. I recommend you use --reasoning-format auto if using third party tools. Here is Open-WebUI with auto:

Please open a new issue if you have any more problems. I understand the confusion behind the reasoning, and it's mostly because GPT-OSS decided to be different than every other model.

isgallagher · 2025-08-16T18:43:55Z

@victorb Yes, the client I am using, RecurseChat, has implemented Harmony parsing. I am having an issue with this client and I tracked it to this PR.

@aldehir I set max tokens to 600 and I'm still missing the final channel message with reasoning-mode set to none.

{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"<|channel|>analysis<|message|>We need to translate \"Hello, world!\" to French. That would be \"Bonjour, le monde !\" Usually \"Hello, world!\" in French is \"Bonjour le monde !\" or \"Bonjour, monde !\" But typical translation: \"Bonjour, le monde !\" with comma after Bonjour. So output: \"Bonjour, le monde !\"<|end|>Bonjour, le monde !"}}],"created":1755369693,"model":"unsloth-gpt-oss-20b","system_fingerprint":"b6182-1fe00296f","object":"chat.completion","usage":{"completion_tokens":80,"prompt_tokens":93,"total_tokens":173},"id":"chatcmpl-WWDDIvPGxCkwxOAKv0U0guNKnnFLZml7","timings":{"prompt_n":16,"prompt_ms":138.022,"prompt_per_token_ms":8.626375,"prompt_per_second":115.92354841981714,"predicted_n":80,"predicted_ms":490.888,"predicted_per_token_ms":6.1361,"predicted_per_second":162.96996463551767}}%

hmmroger · 2025-08-16T18:50:55Z

@isgallagher the output from llama-server looks correct, the client needs to change how it handles gpt-oss if it was designed to use with --reasoning_format none. The reason is that previously there was no proper Harmony support, so client receive bunch of extra tokens in the content when skip reasoning parsing.

A message in Harmony is defined as <|start|>{metadata, think of context info for the message}<|message|>{message text}<|end|>.

Let's just take two basic messages as example, the reasoning and final output:

Reasoning is defined to be in <|start|>assistant<|channel|>analysis<|message|>{reasoning content}<|end|>.
Model output that supposed to be representing actual output users see is defined to be in <|start|>assistant<|channel|>final<|message|>{content}<|end|>.

So with the above in mind, we can see that when you use the option to skip reasoning parsing, the client should expect to receive:

<|channel|>analysis<|message|>{reasoning content}<|end|>{content}

Note: there is no <|start|> in the reasoning above because that is part of input prompt applied by template.

All the special tokens should be parsed correctly, including tool use, preamble (this also goes to content), etc so client does not need to parse it. I looked at the PR previously and I think it should work well.

createthis · 2025-08-16T18:58:27Z

FYI: llama.cpp master and open webui are working beautifully as far as I’m concerned. Works with Open Hands AI with native tool calling enabled too. I’m using the ggufs from this PR.

isgallagher · 2025-08-16T19:09:52Z

OK thanks all, sorry for the noise on this!

semidark · 2025-08-16T19:19:35Z

@createthis, my bad. I used the wrong binaries. I was trying many different GGUFs and compiled binaries and must have mixed something up. At least in OpenWebUI everything looks good now. I hope OpenHands will follow shortly.

Also, sorry for the noise.

isgallagher · 2025-08-16T19:26:34Z

It still looks to me like llama-server is not adhering to the harmony spec since final channel is gone now. However, when using reasoning-format auto, I'm getting

{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"We need to translate \"Hello, world!\" to French. That is \"Bonjour, le monde!\" or \"Bonjour, monde!\" Typically \"Bonjour, le monde!\" is the standard. So output that.","content":"Bonjour, le monde!"}}],"created":1755371350,"model":"unsloth-gpt-oss-20b","system_fingerprint":"b6182-1fe00296f","object":"chat.completion","usage":{"completion_tokens":56,"prompt_tokens":93,"total_tokens":149},"id":"chatcmpl-LCthHqmmWKpi3HhLb9mvtw2a5Q1TylNl","timings":{"prompt_n":93,"prompt_ms":112.888,"prompt_per_token_ms":1.2138494623655915,"prompt_per_second":823.825384451846,"predicted_n":56,"predicted_ms":336.927,"predicted_per_token_ms":6.016553571428572,"predicted_per_second":166.20811036218527}}%

Which shows the breakout of reasoning_content from content. So that's good. It still seems to me like Harmony spec output is broken with this PR. Based on OpenAI Harmony Response Format](https://cookbook.openai.com/articles/openai-harmony), final channel should be present except when tool calling. That's why you're not having issues with tool calling and those other clients probably just adapted to work with what llama-server is outputting, i.e., assuming if a channel isn't specified, it's the final channel. At least that's what it looks like to me. Either way I'm good at this point. Thank you everyone.

EndlessReform · 2025-08-17T00:27:48Z

thanks a lot to everyone and especially @aldehir for all the hard work in this PR! Tool calling in Codex CLI works perfectly now with --jinja and no reasoning format arguments

matiashegoburu · 2025-08-18T02:44:18Z

im running llama-server, and i still dont see the change in openwebui, and roocode is still broken because of the weird harmony tags bleeding in... ? I am using the latest server image... is the server code a different branch?

abc-nix · 2025-08-18T06:13:25Z

@matiashegoburu the summary of this topic for proper harmony is:

Use the latest llama.cpp release
Use the --jinja flag
If you suspect the GGUF has a bad chat template, use a valid jinja like OAI's official chat template.
Use the --reasoning_format auto command option to separate the reasoning from the non-reasoning content when interacting with the server's API.
Connect your frontend to llama-server through the OAI API chat completions endpoint.

Johnnydaboy · 2025-08-18T12:15:37Z

im running llama-server, and i still dont see the change in openwebui, and roocode is still broken because of the weird harmony tags bleeding in... ? I am using the latest server image... is the server code a different branch?

@matiashegoburu

Found this post that solved most of my roocode tool calling issues:

TLDR: uses llama.cpp grammar to force tool call for roo/cline.

Idk if there is a cool way to force cline calls to use the custom grammar and all other calls use the native tool call if you were to use one server to serve to multiple agents. Wonder if there's a flag that does that or I guess you can use llama-swap or something to swap the arguments and force a redeploy. Open to suggestions.

Here's my command:

# ~18 tps on a 3090 at high context usage, ~25 on lower context usage
./llama.cpp/llama-server \
    --model ./unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
    --threads 6 \
    -c 0 -fa \
    --n-cpu-moe 28 \
    --n-gpu-layers 999 \
    --temp 1.0 \
    --min-p 0.0 \
    --top-p 1.0 \
    --top-k 0.0 \
    --no-mmap \
    --mlock \
    --host 0.0.0.0 \
    --port 8080 \
    --jinja \
    --chat-template-file ./unsloth/gpt-oss-120b-GGUF/jinja-gpt-oss-120b.jinja \
    --grammar-file ./unsloth/gpt-oss-120b-GGUF/cline.gbnf

aldehir · 2025-08-18T16:24:16Z

@matiashegoburu the server container build was broken for a while but it looks like it's been fixed #15356. I would try pulling your image again and verifying the version.

aldehir requested a review from ngxson as a code owner August 8, 2025 18:51

github-actions bot added testing Everything test related examples server labels Aug 8, 2025

aldehir force-pushed the feature/harmony-parser branch from d65e556 to 981886f Compare August 9, 2025 03:18

This was referenced Aug 9, 2025

GPT-OSS: parse commentary tool calls; handle glued 'json'; add unit tests (#15102) #15158

Closed

Eval bug: Tools In Prompt Crashing On gpt-oss 20b #15102

Closed

abc-nix mentioned this pull request Aug 9, 2025

Eval bug: output from gpt-oss-20b-F16 model has no comparison the output of the same prompt on gpt-oss.com #15190

Closed

trvon mentioned this pull request Aug 9, 2025

Fix GPT-OSS Harmony format end token handling crash #15195

Closed

aldehir mentioned this pull request Aug 10, 2025

Eval bug: gpt-oss debugging response issues #15113

Closed

aldehir force-pushed the feature/harmony-parser branch from b0b16e2 to 1e595d2 Compare August 14, 2025 08:51

ggerganov approved these changes Aug 14, 2025

View reviewed changes

ggerganov merged commit b204a5a into ggml-org:master Aug 14, 2025
46 of 47 checks passed

p1-0tr mentioned this pull request Aug 14, 2025

chat : Avoid partial reasoning tags in response content #15149

Closed

optimisticupdate mentioned this pull request Aug 15, 2025

Info: llama.cpp-related issue when using docker-model-runner with gpt-oss and tooling enabled docker/model-runner#131

Closed

beshenkaD mentioned this pull request Aug 15, 2025

[Model Request] GPT-OSS 120b mlc-ai/mlc-llm#3306

Open

victorb mentioned this pull request Aug 16, 2025

server : add --reasoning-cache aldehir/llama.cpp#1

Draft

tristan-k mentioned this pull request Aug 16, 2025

issue: Dont support new response/workflow for GPT open models open-webui/open-webui#16303

Closed

9 tasks

danchev mentioned this pull request Aug 19, 2025

Eval bug: Unexpected content at end of input: gpt-oss:20b, codex #15292

Closed

Radrik5 mentioned this pull request Aug 20, 2025

Ollama gpt-oss:20b fail on tool call in Cline ollama/ollama#11991

Open

chaserhkj mentioned this pull request Aug 22, 2025

[FEAT]: OpenAI Harmony support Mintplex-Labs/anything-llm#4262

Open

Uh oh!

gpt-oss: implement harmony parsing #15181

gpt-oss: implement harmony parsing #15181

Uh oh!

Conversation

aldehir commented Aug 8, 2025

Implementation

Remaining Work

Uh oh!

abc-nix commented Aug 8, 2025

Uh oh!

dagbs commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Aug 9, 2025

Uh oh!

victorb commented Aug 9, 2025

Uh oh!

victorb commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mushoz commented Aug 9, 2025

Uh oh!

victorb commented Aug 9, 2025

Uh oh!

aldehir commented Aug 9, 2025

Uh oh!

aldehir commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Aug 10, 2025

Uh oh!

victorb commented Aug 10, 2025

Uh oh!

tarruda commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Aug 10, 2025

Uh oh!

tarruda commented Aug 10, 2025

Uh oh!

tarruda commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaserhkj commented Aug 10, 2025

Uh oh!

aldehir commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahmetkca commented Aug 11, 2025

Uh oh!

prd-tuong-nguyen commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saadsafi commented Aug 14, 2025

Uh oh!

ggerganov commented Aug 14, 2025

Uh oh!

isgallagher commented Aug 16, 2025

Uh oh!

semidark commented Aug 16, 2025

Uh oh!

victorb commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isgallagher commented Aug 16, 2025

Uh oh!

victorb commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dagbs commented Aug 8, 2025 •

edited

Loading

victorb commented Aug 9, 2025 •

edited

Loading

aldehir commented Aug 9, 2025 •

edited

Loading

tarruda commented Aug 10, 2025 •

edited

Loading

tarruda commented Aug 10, 2025 •

edited

Loading

aldehir commented Aug 10, 2025 •

edited

Loading

aldehir commented Aug 10, 2025 •

edited

Loading

prd-tuong-nguyen commented Aug 11, 2025 •

edited

Loading

ggerganov left a comment •

edited

Loading

victorb commented Aug 16, 2025 •

edited

Loading

victorb commented Aug 16, 2025 •

edited

Loading

aldehir commented Aug 16, 2025 •

edited

Loading

aldehir commented Aug 18, 2025 •

edited

Loading