-
Notifications
You must be signed in to change notification settings - Fork 12.7k
gpt-oss: implement harmony parsing #15181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks. It finally made it much easier to use tools in Cherry Studio. And it generates thinking boxes properly. |
With the PR: It's better, easily more usable, but there might be some issues around tool calling still. |
@dagbs try setting function calling to |
d65e556
to
981886f
Compare
I tried this PR yesterday and compared it to #15158 (+ my own fixes on top of that PR) and there was a couple of issues with this PR (that I was gonna share this morning), but since da67163 was pushed, it seems to finally work better than that PR. In my (albeit limited) testing, seems tool calling and it's formatting is working a lot better. Thanks a ton for this patch @aldehir! All the unit tests pass as well, compared to the other PR, and code organization at a glance seems better too, but granted I'm not cpp expert, just an generalist. |
Hmm, seems to still be breaking sometimes, tried to understand why but to no avail. Most of the time, it works perfectly fine, but seems some edge-case breaks it. Running da67163 right now. If I repeatably use the same weather example maybe 10 times, I end up getting a badly parsed (on llama.cpp's side) maybe once. Good run looks like this: ChatCompletionResponse {
choices: [
Choice {
message: ResponseMessage {
content: Some(
"Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*",
),
reasoning_content: Some(
"The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f} +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f} +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f} +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.",
),
tool_calls: [],
},
},
],
}
Meanwhile, a bad runs ends up with: ChatCompletionResponse {
choices: [
Choice {
message: ResponseMessage {
content: Some(
" to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n",
),
reasoning_content: None,
tool_calls: [],
},
},
],
} Full logs from bad run:
Seems to happen more often when |
@victorb maybe use temperature= 0 and/or top-k 1? If inference is the issue, making it deterministic would fix it. |
Running with these inference parameters for example: {
temperature: 0.0,
top_p: 1.0,
min_p: 0.0,
top_k: 0,
samplers: [
"top_k",
"top_p",
"min_p",
"temperature",
],
} Seems to correctly give me deterministic responses, which once I get one good response, they always work well, but the ones that break, always break, so I guess useful for testing at the very least. Here's one example of broken parsing I'm currently getting, even with ChatCompletionResponse {
choices: [
Choice {
message: ResponseMessage {
content: Some(
" to=function\u{a0}\u{a0}...",
),
reasoning_content: None,
tool_calls: [],
},
},
],
}
Tried setting |
@victorb thank you for that extensive testing. I can't seem to reproduce this on
That will help me better understand the problem. It appears the model is emitting unicode space characters, but I wasn't aware the |
I managed to get Looks like I missed a scenario where the model outputs the recipient (
I have yet to see the I updated the parsing and grammar rule to handle this. It should at least parse the tool calls now. I found performance degrades by the third call. I get queries to "Lima??", "Lima?", or some variation with garbage at the end. However, if I pass Give cf9a0d6 a shot. |
For those interested, I implemented a basic cache for reasoning content in my fork aldehir#1. Without prior reasoning content for tool calls, |
Awesome @aldehir, did a bunch of testing yesterday with 20b and 120b and tool parsing didn't fail once! 🎉 I do see the same inference quality degradation after a few messages, mainly hallucinations for the tool arguments (calling get_weather("...") or get_weather("?") for example) with both 20b and 120b. However, trying out the Overall, seems solid to me now. Since cf9a0d6, the parsing of Harmony seems complete in all the examples I've tried to run, everything goes into the right place and tool calls/responses all look correct now. |
@aldehir using your
If If I wonder if this is related to the grammar generation for the tool calls which is somehow constraining it to always use the first tool. BTW this is the first model I've tried with llama-server that can mix reasoning with tool calls, so it is definitely in the right direction! |
@tarruda good catch. I forgot to group up the tool calls when I reworked the grammar to account for the recipient in the role. I've updated both this PR and the one in my fork. |
Thanks a lot, seems to be working perfectly now! |
I've also been playing with calling tools in its CoT and confirm it is working correctly. For example, if I provide this tool to the LLM: async def arithmetic(code: str) -> str:
"""
Evaluates arithmetic expression and returns the result.
ANY arithmetic questions (no matter how trivial) should make use of this tool in your chain of thought. Always return this tool's response even if it is wrong!
"""
return f"{eval("5 + 5")}" Then it will always use it during reasoning. There's something I'm wondering though: Looking at the template, I can see it tells the LLM about 2 possible builtin tools it can use in its CoT ( |
@tarruda those tools cause the model to produce different type constraints other than json. I believe the Python one produces
So I think they need their own grammar rules. From what I can tell, it seems those tools are intended to be resolved internally and not sent back to the user. For example, the Python one mentions a I suppose it could process the builtins, generate tool calls, and any interested parties can implement middleware to intercept the calls. |
@aldehir I am experimenting with your Does this mean I need to ditch the GGUF embedded template and use |
@chaserhkj that's because open-webui injects the reasoning itself into the content. I added a fix in my fork to address that. But for open-webui, this PR should be enough. I don't think the reasoning cache has gotten enough use for me to recommend it, I simply wanted to show that the model performs better when you pass along its reasoning in tool calls. If you'd like to keep using it, feel free to resume the conversation there. Like I said, for open-webui it shouldn't be necessary. |
Yes we are expected to drop analysis channel when the last channel ends with I am trying out gpt-oss 120b right now and lack of harmony response format parsing is one of the biggest obstacle to using this model. |
Hi everyone, are there any Docker image releases that support this feature? |
b0b16e2
to
1e595d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be good to merge - did some testing and tool calls are working both with CC and llama.vscode
agent.
@ngxson Let's fix any potential problems from master
.
gpt-oss works now in my MCP tests. |
@ggerganov This commit has broken llama-server for gpt-oss-20b in chat software such as Jan and RecurseChat. Both clients are treating all of the response as "Thought". I am running the server with this command: When I set reasoning-format to auto, both clients are not showing any "Thought" or thinking response at all, only the final message. Is this a bug with Jan and RecurseChat Harmony implementation or did we remove a critical part, e.g. the <|channel|>final<|message|>? Sorry not sure where the problem is, but I did track it down to this commit. Any help would be appreciated, thank you! |
@isgallagher @semidark Are you sure this isn't a problem in their implementations, rather than llama.cpp? I'm asking as I'm not seeing the same in my own tests, but I'm using my own client library. Currently using llama.cpp 1fe0029, ran like this: $ ./build/bin/llama-server \
-m /mnt/nas/models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
--ctx-size 131072 \
--flash-attn \
--gpu-layers 99 \
--threads $(nproc) --threads-batch $(nproc) \
--jinja \
--verbose-prompt -v And everything seems to be parsed and returned correctly. Example:
I think the tests added from this PR confirms the correct behavior as well, as far as I can tell. |
@victorb Hi Victor, thanks for the response here. I am still having issues even with curl...
I am missing the Here is the same curl with |
@isgallagher that seems to be because of the With $ curl -X POST "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" -d '{ "model": "unsloth-gpt-oss-20b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Translate the following English text to French: \"Hello, world!\""} ], "max_tokens": 1024, "temperature": 1.0, "n": 1, "stop": null }' | jq .choices
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3591 100 3316 100 275 5261 436 --:--:-- --:--:-- --:--:-- 5690
[
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "<|channel|>analysis<|message|>The user asks to translate \"Hello, world!\" to French. That's straightforward: \"Bonjour, le monde!\" or \"Bonjour, monde!\" The common translation is \"Bonjour, le monde !\" with space before exclamation. Provide translation.<|end|>« Bonjour, le monde ! »"
}
}
] With $ curl -X POST "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" -d '{ "model": "unsloth-gpt-oss-20b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Translate the following English text to French: \"Hello, world!\""} ], "max_tokens": 1024, "temperature": 1.0, "n": 1, "stop": null }' | jq .choices
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3543 100 3268 100 275 5100 429 --:--:-- --:--:-- --:--:-- 5527
[
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "We need to translate \"Hello, world!\" into French. Simple: \"Bonjour, le monde !\" or \"Bonjour le monde!\" Usually \"Bonjour le monde!\" or \"Salut, le monde!\". Probably \"Bonjour, le monde !\" Provide translation.",
"content": "**Bonjour, le monde !**"
}
}
] (FYI: As far as I know, temperature needs to be set to either 1.0 or 0.0 for GPT-OSS, I'm guessing the model might even screw up the Harmony syntax itself outside of those values, but haven't tested outside of the values much myself so someone correct me if that's wrong) Edit: Just re-read your first message:
I think this means they've either implemented it like that on purpose, or it's a bug in how they handle reasoning_content/content split. |
@isgallagher Increase the max tokens for your curl example. The analysis counts towards the token count so it never gets to the end. Also, use @semidark same deal, the Please open a new issue if you have any more problems. I understand the confusion behind the reasoning, and it's mostly because GPT-OSS decided to be different than every other model. |
@victorb Yes, the client I am using, RecurseChat, has implemented Harmony parsing. I am having an issue with this client and I tracked it to this PR. @aldehir I set max tokens to 600 and I'm still missing the
|
@isgallagher the output from llama-server looks correct, the client needs to change how it handles gpt-oss if it was designed to use with A message in Harmony is defined as Let's just take two basic messages as example, the reasoning and final output:
So with the above in mind, we can see that when you use the option to skip reasoning parsing, the client should expect to receive:
Note: there is no <|start|> in the reasoning above because that is part of input prompt applied by template. All the special tokens should be parsed correctly, including tool use, preamble (this also goes to content), etc so client does not need to parse it. I looked at the PR previously and I think it should work well. |
FYI: llama.cpp master and open webui are working beautifully as far as I’m concerned. Works with Open Hands AI with native tool calling enabled too. I’m using the ggufs from this PR. |
OK thanks all, sorry for the noise on this! |
@createthis, my bad. I used the wrong binaries. I was trying many different GGUFs and compiled binaries and must have mixed something up. At least in OpenWebUI everything looks good now. I hope OpenHands will follow shortly. Also, sorry for the noise. |
It still looks to me like llama-server is not adhering to the harmony spec since final channel is gone now. However, when using reasoning-format auto, I'm getting
Which shows the breakout of |
thanks a lot to everyone and especially @aldehir for all the hard work in this PR! Tool calling in Codex CLI works perfectly now with |
This is my attempt at implementing a harmony parser for gpt-oss.
Implementation
auto
andnone
are supported. Whennone
,<|channel|>analysis<|message|>{reasoning content}<|end|>
is added to the content.parse_tool_calls == false
, tool calls are added to the content verbatim--which aligns with other implementations.Remaining Work
reasoning_content
. However, none of the clients I tested send it. A simple workaround is to usereasoning_format = none
, or add the reasoning to the content in tool calls.