Add support for different reasoning fields #15362

aldehir · 2025-08-16T19:25:46Z

aldehir
Aug 16, 2025
Collaborator

Several clients do not support the reasoning_content field, and it seems like both clients and inference servers have converged to a reasoning field.

One such example is OpenAI's own compatibility test for GPT-OSS.
llama.cpp fails 30/30 tests simply because it places the reasoning in reasoning_content and not reasoning.

The agent-js library from OpenAI now includes this functionality (openai/openai-agents-js@7b437d9). I've verified that they return the reasoning within the message, which meets the gpt-oss spec requiring CoT output from the final tool call message.

Analysis (with reasoning_content field)


> start
> tsx index.ts -k 1 --provider=llamacpp

❯ Processing case 0 (attempt 1)
❯ Processing case 1 (attempt 1)
❯ Processing case 2 (attempt 1)
❯ Processing case 3 (attempt 1)
❯ Processing case 4 (attempt 1)
› Case 0 (attempt 1): Failed 
✔ Processing case 0 (attempt 1)
❯ Processing case 5 (attempt 1)
› Case 2 (attempt 1): Failed 
✔ Processing case 2 (attempt 1)
❯ Processing case 6 (attempt 1)
› Case 6 (attempt 1): Failed 
✔ Processing case 6 (attempt 1)
❯ Processing case 7 (attempt 1)
› Case 3 (attempt 1): Failed 
✔ Processing case 3 (attempt 1)
❯ Processing case 8 (attempt 1)
› Case 5 (attempt 1): Failed 
✔ Processing case 5 (attempt 1)
❯ Processing case 9 (attempt 1)
› Case 7 (attempt 1): Failed 
✔ Processing case 7 (attempt 1)
❯ Processing case 10 (attempt 1)
› Case 8 (attempt 1): Failed 
✔ Processing case 8 (attempt 1)
❯ Processing case 11 (attempt 1)
› Case 10 (attempt 1): Failed 
✔ Processing case 10 (attempt 1)
❯ Processing case 12 (attempt 1)
› Case 11 (attempt 1): Failed 
✔ Processing case 11 (attempt 1)
❯ Processing case 13 (attempt 1)
› Case 9 (attempt 1): Failed 
✔ Processing case 9 (attempt 1)
❯ Processing case 14 (attempt 1)
› Case 12 (attempt 1): Failed 
✔ Processing case 12 (attempt 1)
❯ Processing case 15 (attempt 1)
› Case 13 (attempt 1): Failed 
✔ Processing case 13 (attempt 1)
❯ Processing case 16 (attempt 1)
› Case 14 (attempt 1): Failed 
✔ Processing case 14 (attempt 1)
❯ Processing case 17 (attempt 1)
› Case 17 (attempt 1): Failed 
✔ Processing case 17 (attempt 1)
❯ Processing case 18 (attempt 1)
› Case 16 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[10],[20],[30]],"chart_type":"bar","title":"Sales Q1–Q3","x_label":"Quarter","y_label":"Sales"} Expected: {"data":[[1,10],[2,20],[3,30]],"chart_type":"bar","title":"Quarterly Sales"}
✔ Processing case 16 (attempt 1)
❯ Processing case 19 (attempt 1)
› Case 19 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[100],[150],[120]],"chart_type":"bar","title":"Visits per Day","x_label":"Day","y_label":"Visits"} Expected: {"data":[[1,100],[2,150],[3,120]],"chart_type":"bar","title":"Daily Visits","y_label":"Visitors"}
✔ Processing case 19 (attempt 1)
❯ Processing case 20 (attempt 1)
› Case 18 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[70,72,68,65]],"chart_type":"line","title":"Temperature over 4 days","x_label":"Day","y_label":"Temperature"} Expected: {"data":[[1,70],[2,72],[3,68],[4,65]],"chart_type":"line","x_label":"Day"}
✔ Processing case 18 (attempt 1)
❯ Processing case 21 (attempt 1)
› Case 20 (attempt 1): Failed 
✔ Processing case 20 (attempt 1)
❯ Processing case 22 (attempt 1)
› Case 21 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"orders","columns":["order_id","amount"],"filters":"status = 'shipped'","limit":100} Expected: {"table":"orders","columns":["order_id","amount"],"filters":"status = 'shipped'"}
✔ Processing case 21 (attempt 1)
❯ Processing case 23 (attempt 1)
› Case 22 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"products","columns":["name","price"],"filters":"","order_by":"price DESC"} Expected: {"table":"products","columns":["name","price"],"limit":10,"order_by":"price DESC"}
✔ Processing case 22 (attempt 1)
❯ Processing case 24 (attempt 1)
› Case 4 (attempt 1): Failed 
✔ Processing case 4 (attempt 1)
❯ Processing case 25 (attempt 1)
› Case 1 (attempt 1): Failed 
✔ Processing case 1 (attempt 1)
❯ Processing case 26 (attempt 1)
› Case 25 (attempt 1): Failed 
✔ Processing case 25 (attempt 1)
❯ Processing case 27 (attempt 1)
› Case 27 (attempt 1): Failed 
✔ Processing case 27 (attempt 1)
❯ Processing case 28 (attempt 1)
› Case 26 (attempt 1): Failed 
✔ Processing case 26 (attempt 1)
❯ Processing case 29 (attempt 1)
› Case 28 (attempt 1): Failed 
✔ Processing case 28 (attempt 1)
› Case 29 (attempt 1): Failed 
✔ Processing case 29 (attempt 1)
› Case 23 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"audit_log","columns":["*"],"filters":"","limit":3} Expected: {"table":"audit_log","columns":["id","timestamp","action"],"limit":3}
✔ Processing case 23 (attempt 1)
› Case 24 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"customers","columns":["name","city"],"filters":"city = 'Berlin'","limit":100} Expected: {"table":"customers","columns":["name","city"],"filters":"city = 'Berlin'"}
✔ Processing case 24 (attempt 1)
› Case 15 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[1,2],[2,4],[3,9]],"chart_type":"line","title":"Sample Line Chart","x_label":"X","y_label":"Y"} Expected: {"data":[[1,2],[2,4],[3,9]],"chart_type":"line"}
✔ Processing case 15 (attempt 1)
Results written to /home/alde/dev/github.com/openai/gpt-oss/compatibility-test/rollout_llamacpp_20250816_132634.jsonl
Summary:
  Provider: llamacpp
  Total input cases: 30
  Tries: 1
  Total tasks: 30
  Total runs: 30
  Invalid Chat Completions API responses: 30 (out of 30)
  pass@k (k=1..1): 1=0.000
  pass^k (k=1..1): 1=0.000
  pass@k (k=1): 0.000
  pass^k (k=1): 0.000
  Wrong-input tool calls: 8
  Invalid cases.jsonl lines: 0
  Analysis written to /home/alde/dev/github.com/openai/gpt-oss/compatibility-test/analysis_llamacpp_20250816_132634.json

When modified to use reasoning, llama.cpp passes 30/30 tests with reasoning_effort = low.

Analysis (with reasoning field)

> start
> tsx index.ts -k 1 --provider=llamacpp

❯ Processing case 0 (attempt 1)
❯ Processing case 1 (attempt 1)
❯ Processing case 2 (attempt 1)
❯ Processing case 3 (attempt 1)
❯ Processing case 4 (attempt 1)
› Case 0 (attempt 1): Success 
✔ Processing case 0 (attempt 1)
❯ Processing case 5 (attempt 1)
› Case 4 (attempt 1): Success 
✔ Processing case 4 (attempt 1)
❯ Processing case 6 (attempt 1)
› Case 3 (attempt 1): Success 
✔ Processing case 3 (attempt 1)
❯ Processing case 7 (attempt 1)
› Case 2 (attempt 1): Success 
✔ Processing case 2 (attempt 1)
❯ Processing case 8 (attempt 1)
› Case 5 (attempt 1): Success 
✔ Processing case 5 (attempt 1)
❯ Processing case 9 (attempt 1)
› Case 1 (attempt 1): Success 
✔ Processing case 1 (attempt 1)
❯ Processing case 10 (attempt 1)
› Case 10 (attempt 1): Success 
✔ Processing case 10 (attempt 1)
❯ Processing case 11 (attempt 1)
› Case 6 (attempt 1): Success 
✔ Processing case 6 (attempt 1)
❯ Processing case 12 (attempt 1)
› Case 11 (attempt 1): Success 
✔ Processing case 11 (attempt 1)
❯ Processing case 13 (attempt 1)
› Case 12 (attempt 1): Success 
✔ Processing case 12 (attempt 1)
❯ Processing case 14 (attempt 1)
› Case 9 (attempt 1): Success 
✔ Processing case 9 (attempt 1)
❯ Processing case 15 (attempt 1)
› Case 14 (attempt 1): Success 
✔ Processing case 14 (attempt 1)
❯ Processing case 16 (attempt 1)
› Case 13 (attempt 1): Success 
✔ Processing case 13 (attempt 1)
❯ Processing case 17 (attempt 1)
› Case 16 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[10],[20],[30]],"chart_type":"bar","title":"Sales Q1-Q3","x_label":"Quarter","y_label":"Sales"} Expected: {"data":[[1,10],[2,20],[3,30]],"chart_type":"bar","title":"Quarterly Sales"}
✔ Processing case 16 (attempt 1)
❯ Processing case 18 (attempt 1)
› Case 17 (attempt 1): Success 
✔ Processing case 17 (attempt 1)
❯ Processing case 19 (attempt 1)
› Case 15 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[1,2],[2,4],[3,9]],"chart_type":"line","title":"Simple Line Chart","x_label":"X","y_label":"Y"} Expected: {"data":[[1,2],[2,4],[3,9]],"chart_type":"line"}
✔ Processing case 15 (attempt 1)
❯ Processing case 20 (attempt 1)
› Case 19 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[100],[150],[120]],"chart_type":"bar","title":"Visits per Day","x_label":"Day","y_label":"Visits"} Expected: {"data":[[1,100],[2,150],[3,120]],"chart_type":"bar","title":"Daily Visits","y_label":"Visitors"}
✔ Processing case 19 (attempt 1)
❯ Processing case 21 (attempt 1)
› Case 18 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[70],[72],[68],[65]],"chart_type":"line","title":"Temperature over 4 days","x_label":"Day","y_label":"Temperature"} Expected: {"data":[[1,70],[2,72],[3,68],[4,65]],"chart_type":"line","x_label":"Day"}
✔ Processing case 18 (attempt 1)
❯ Processing case 22 (attempt 1)
› Case 20 (attempt 1): Success 
✔ Processing case 20 (attempt 1)
❯ Processing case 23 (attempt 1)
› Case 21 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"orders","columns":["order_id","amount"],"filters":"status = 'shipped'","limit":100} Expected: {"table":"orders","columns":["order_id","amount"],"filters":"status = 'shipped'"}
✔ Processing case 21 (attempt 1)
❯ Processing case 24 (attempt 1)
› Case 22 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"products","columns":["name","price"],"order_by":"price DESC"} Expected: {"table":"products","columns":["name","price"],"limit":10,"order_by":"price DESC"}
✔ Processing case 22 (attempt 1)
❯ Processing case 25 (attempt 1)
› Case 23 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"audit_log","columns":["*"],"filters":"","limit":3,"order_by":"id ASC"} Expected: {"table":"audit_log","columns":["id","timestamp","action"],"limit":3}
✔ Processing case 23 (attempt 1)
❯ Processing case 26 (attempt 1)
› Case 25 (attempt 1): Success 
✔ Processing case 25 (attempt 1)
❯ Processing case 27 (attempt 1)
› Case 26 (attempt 1): Success 
✔ Processing case 26 (attempt 1)
❯ Processing case 28 (attempt 1)
› Case 24 (attempt 1): Success 
✔ Processing case 24 (attempt 1)
❯ Processing case 29 (attempt 1)
› Case 27 (attempt 1): Success 
✔ Processing case 27 (attempt 1)
› Case 28 (attempt 1): Success 
✔ Processing case 28 (attempt 1)
› Case 29 (attempt 1): Success 
✔ Processing case 29 (attempt 1)
› Case 7 (attempt 1): Success 
✔ Processing case 7 (attempt 1)
› Case 8 (attempt 1): Success 
✔ Processing case 8 (attempt 1)
Results written to /home/alde/dev/github.com/openai/gpt-oss/compatibility-test/rollout_llamacpp_20250816_132243.jsonl
Summary:
  Provider: llamacpp
  Total input cases: 30
  Tries: 1
  Total tasks: 30
  Total runs: 30
  Invalid Chat Completions API responses: 0 (out of 30)
  pass@k (k=1..1): 1=1.000
  pass^k (k=1..1): 1=1.000
  pass@k (k=1): 1.000
  pass^k (k=1): 1.000
  Wrong-input tool calls: 7
  Invalid cases.jsonl lines: 0
  Analysis written to /home/alde/dev/github.com/openai/gpt-oss/compatibility-test/analysis_llamacpp_20250816_132243.json

I'm willing to submit a pull request for this issue, but since it appears to be a feature request, I'm posting it in discussions first to gather feedback.

From what I can recall, agentic coding tools like codex and crush support reasoning but lack reasoning_content support. There are likely other tools that behave similarly.

Thoughts?

mostlygeek · 2025-08-17T07:28:35Z

mostlygeek
Aug 17, 2025

Since llama-server is supposed to provide OpenAI compatible endpoints it makes a lot of sense to support reasoning as the default.

We could add “openai” as an option for —reasoning-format. The operators choose to
use auto, none, deepseek, openai.

4 replies

ggerganov Aug 19, 2025
Maintainer

Would it make sense to send both "reasoning" and "reasoning_content"? At least until there is consensus in the community. This would avoid adding an extra argument and complicating the UX.

mostlygeek Aug 20, 2025

Sending both keys would make the api response really confusing. It should send one or the other.

Adding an argument to make it configurable is a good tradeoff. The server would support more clients in exchange for a bit of code and tiny bit more complexity.

The UX of the server is a steep learning curve for new users. However, I think that’s ok because it’s so configurable/powerful. The UX of the server would benefit more from better docs than fewer options IMO.

ggerganov Aug 20, 2025
Maintainer

Ok makes sense.

I am getting the impression that reasoning is the more widely adopted field, so it makes sense to be the default.

breisa Aug 22, 2025

Cloud providers like Groq and Cerebras also use "reasoning" for the reasoning tokens. In addition to renaming the response field to "reasoning" it would make sense to document if and how llamacpp-server supports reasoning in incoming requests. It seems like this is currently documented nowhere. This is especially important with models like GPT-OSS that need that past reasoning because they perform tool calls during reasoning [1].
[1] https://platform.openai.com/docs/guides/reasoning#keeping-reasoning-items-in-context

Mushoz · 2025-08-18T21:43:32Z

Mushoz
Aug 18, 2025

What are those Success Tool call with wrong arguments but correct schema errors about? Is that something that needs fixing?

5 replies

aldehir Aug 18, 2025
Collaborator Author

@Mushoz gpt-oss is responding back with values for optional parameters but the tests only expect the required parameters. I don't believe it to be indicative of a failure, since they are optional and the model is opting to include them.

aldehir Aug 18, 2025
Collaborator Author

Actually, maybe the template isn't properly marking them optional in the system prompt. I'll check it out later.

aldehir Aug 19, 2025
Collaborator Author

The generated system prompt looks correct. The 20B model does appear to produce incorrect results in 1 or 2 tests cases, but for the rest the message emits because it includes optional parameters. This is within the margin of error for OpenAI's compatibility test.

system prompt

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-19

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Tools

## functions

namespace functions {

// Creates a base64-encoded PNG chart from tabular data—use for quick visualizations inside chat.
type generate_chart = (_: {
// Type of chart to generate
chart_type: "line" | "bar" | "scatter",
// 2-D numeric data matrix
data: number[][],
// Chart title
title?: string, // default: "",
// Label for the x-axis
x_label?: string, // default: "",
// Label for the y-axis
y_label?: string, // default: "",
}) => any;

} // namespace functions<|end|><|start|>user<|message|>Plot a simple line chart for these points: (1,2),(2,4),(3,9).<|end|><|start|>assistant<|channel|>analysis<|message|>Need to call generate_chart function. Plot simple line chart. Data as numeric matrix? Probably 2-D matrix of coordinates. Provide data: [[1,2],[2,4],[3,9]] and chart_type 'line'. Title? maybe "Simple Line Chart". Provide x_label and y_label maybe "X" and "Y". Use the function.<|end|><|start|>assistant to=functions.generate_chart<|channel|>commentary json<|message|>{"chart_type":"line","data":[[1,2],[2,4],[3,9]],"title":"Simple Line Chart","x_label":"X","y_label":"Y"}<|call|><|start|>functions.generate_chart to=assistant<|channel|>commentary<|message|>{"image_png_base64":"iVBORw0KGgoAAAANSUhEUgAA..."}<|end|><|start|>assistant

I also found this:

There is currently no generally agreed upon specification in the community with the general properties on a message being either reasoning or reasoning_content. To be compatible with clients like the OpenAI Agents SDK we recommend using a reasoning field as the primary property for the raw CoT in Chat Completions.

https://cookbook.openai.com/articles/gpt-oss/verifying-implementations#chat-completions

Mushoz Aug 19, 2025

Thanks for double checking!

dkundel-openai Aug 19, 2025

Those errors are just there because we verify if the content is the same as what we expected but in some cases the model might return slightly different data. It's not a big deal as it's most likely a model performance or prompting issue hence why it's not treated as a failure.

Add support for different reasoning fields #15362

Uh oh!

aldehir Aug 16, 2025 Collaborator

Replies: 2 comments · 9 replies

Uh oh!

mostlygeek Aug 17, 2025

Uh oh!

ggerganov Aug 19, 2025 Maintainer

Uh oh!

mostlygeek Aug 20, 2025

Uh oh!

ggerganov Aug 20, 2025 Maintainer

Uh oh!

breisa Aug 22, 2025

Uh oh!

Mushoz Aug 18, 2025

Uh oh!

aldehir Aug 18, 2025 Collaborator Author

Uh oh!

aldehir Aug 18, 2025 Collaborator Author

Uh oh!

Uh oh!

aldehir Aug 19, 2025 Collaborator Author

Uh oh!

Mushoz Aug 19, 2025

Uh oh!

dkundel-openai Aug 19, 2025

aldehir
Aug 16, 2025
Collaborator

Replies: 2 comments 9 replies

mostlygeek
Aug 17, 2025

ggerganov Aug 19, 2025
Maintainer

ggerganov Aug 20, 2025
Maintainer

Mushoz
Aug 18, 2025

aldehir Aug 18, 2025
Collaborator Author

aldehir Aug 18, 2025
Collaborator Author

aldehir Aug 19, 2025
Collaborator Author