chat: handle gpt-oss return/end token inconsistency #15421

danbev · 2025-08-19T12:14:36Z

This commit addresses an inconsistency during inference by adding a new member to the templates_params struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function common_chat_params_init_gpt_oss to check this flag and the add_generation_prompt flag to determine if it should replace the <|return|> token with the <|end|> token in the prompt.

The motivation for this change is to ensure that the formatted prompt of past messages in common_chat_format_single matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: <|return|> when add_generation_prompt is false, and <|end|> when add_generation_prompt is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: #15417

This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<|return|>` token with the `<|end|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<|return|>` when `add_generation_prompt` is false, and `<|end|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'. Resolves: ggml-org#15417

CISC · 2025-08-19T12:38:16Z

Something seems to be missing, when is is_inference not true?

danbev · 2025-08-19T12:45:16Z

Something seems to be missing, when is is_inference not true?

I was thinking this would be set to false for training. Not sure if there is a place that requires this to be updated at the moment though (or maybe I'm just not aware of it).

CISC · 2025-08-19T12:50:57Z

Something seems to be missing, when is is_inference not true?

I was thinking this would be set to false for training. Not sure if there is a place that requires this to be updated at the moment though (or maybe I'm just not aware of it).

Ah I see, may be useful for @gabe-l-hart aLoRA PR?

gabe-l-hart · 2025-08-19T13:11:52Z

Interesting! I think this is close to what the alora activation strings need if the return and end tokens were configurable. For the aloras, we'd want to set return to "None<|end_of_turn|>" and end to "" to simply leave the last turn open (ie a non-assistant generation prompt). I'm not sure it's worth conflating the two in this PR though. Assuming alora will take a bit longer to get right, I can always try to make this more abstract if it makes sense once it's merged.

ggerganov · 2025-08-20T09:10:02Z

I am still trying to wrap around my head around the root cause of the issue.

The issue is that the gpt-oss template returns different end tags: <|return|> when add_generation_prompt is false, and <|end|> when add_generation_prompt is true.

What is the reason for this behavior? Is it a bug in the template?

CISC · 2025-08-20T09:18:20Z

What is the reason for this behavior? Is it a bug in the template?

It seems to be deliberate:
https://huggingface.co/openai/gpt-oss-20b/blob/cbf31f62664d4b1360b3a78427f7b3c3ed8f0fa8/chat_template.jinja#L302-L316

danbev · 2025-08-20T09:25:02Z

What is the reason for this behavior? Is it a bug in the template?

I don't think this is a bug in the template as it looks intentional. The outputs are different depending on the passed in add_generation_prompt:

{%- elif loop.last and not add_generation_prompt %}
            {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
            {#- This is a situation that should only occur in training, never in inference. #}
            {%- if "thinking" in message %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
            {%- endif %}

            {#- <|return|> indicates the end of generation, but <|end|> does not #}
            {#- <|return|> should never be an input to the model, but we include it as the final token #}
            {#- when training, so the model learns to emit it. #}
---->       {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
        {%- else %}
            {#- CoT is dropped during all previous turns, so we never render it for inference #}
---->       {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
            {%- set last_tool_call.name = none %}
        {%- endif %}

ggerganov

I see - it has implications when training and hence the current chat template logic.

This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<|return|>` token with the `<|end|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<|return|>` when `add_generation_prompt` is false, and `<|end|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'. Resolves: ggml-org#15417

…#15421)" This reverts commit 657b8a7.

ggerganov approved these changes Aug 20, 2025

View reviewed changes

danbev merged commit 657b8a7 into ggml-org:master Aug 20, 2025
47 checks passed

danbev deleted the gpt-oss-chat-template-return branch August 21, 2025 10:19

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025

Revert "chat: handle gpt-oss return/end token inconsistency (ggml-org…

68d3134

…#15421)" This reverts commit 657b8a7.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chat: handle gpt-oss return/end token inconsistency #15421

chat: handle gpt-oss return/end token inconsistency #15421

Uh oh!

danbev commented Aug 19, 2025

Uh oh!

CISC commented Aug 19, 2025

Uh oh!

danbev commented Aug 19, 2025

Uh oh!

CISC commented Aug 19, 2025

Uh oh!

gabe-l-hart commented Aug 19, 2025

Uh oh!

ggerganov commented Aug 20, 2025

Uh oh!

CISC commented Aug 20, 2025 •

edited

Loading

Uh oh!

danbev commented Aug 20, 2025

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chat: handle gpt-oss return/end token inconsistency #15421

chat: handle gpt-oss return/end token inconsistency #15421

Uh oh!

Conversation

danbev commented Aug 19, 2025

Uh oh!

CISC commented Aug 19, 2025

Uh oh!

danbev commented Aug 19, 2025

Uh oh!

CISC commented Aug 19, 2025

Uh oh!

gabe-l-hart commented Aug 19, 2025

Uh oh!

ggerganov commented Aug 20, 2025

Uh oh!

CISC commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danbev commented Aug 20, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CISC commented Aug 20, 2025 •

edited

Loading