Skip to content

Conversation

@danbev
Copy link
Member

@danbev danbev commented Aug 19, 2025

This commit addresses an inconsistency during inference by adding a new member to the templates_params struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function common_chat_params_init_gpt_oss to check this flag and the add_generation_prompt flag to determine if it should replace the <|return|> token with the <|end|> token in the prompt.

The motivation for this change is to ensure that the formatted prompt of past messages in common_chat_format_single matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: <|return|> when add_generation_prompt is false, and <|end|> when add_generation_prompt is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: #15417

This commit addresses an inconsistency during inference by adding a new
member to the `templates_params` struct to indicate whether the chat is
in inference mode. This allows the gpt-oss specific function
`common_chat_params_init_gpt_oss` to check this flag and the
`add_generation_prompt` flag to determine if it should replace the
`<|return|>` token with the `<|end|>` token in the prompt.

The motivation for this change is to ensure that the formatted prompt of
past messages in `common_chat_format_single` matches the output of the
formatted new message. The issue is that the gpt-oss template returns
different end tags: `<|return|>` when `add_generation_prompt` is false,
and `<|end|>` when `add_generation_prompt` is true. This causes the
substring function to start at an incorrect position, resulting in
tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: ggml-org#15417
@CISC
Copy link
Collaborator

CISC commented Aug 19, 2025

Something seems to be missing, when is is_inference not true?

@danbev
Copy link
Member Author

danbev commented Aug 19, 2025

Something seems to be missing, when is is_inference not true?

I was thinking this would be set to false for training. Not sure if there is a place that requires this to be updated at the moment though (or maybe I'm just not aware of it).

@CISC
Copy link
Collaborator

CISC commented Aug 19, 2025

Something seems to be missing, when is is_inference not true?

I was thinking this would be set to false for training. Not sure if there is a place that requires this to be updated at the moment though (or maybe I'm just not aware of it).

Ah I see, may be useful for @gabe-l-hart aLoRA PR?

@gabe-l-hart
Copy link
Collaborator

Interesting! I think this is close to what the alora activation strings need if the return and end tokens were configurable. For the aloras, we'd want to set return to "None<|end_of_turn|>" and end to "" to simply leave the last turn open (ie a non-assistant generation prompt). I'm not sure it's worth conflating the two in this PR though. Assuming alora will take a bit longer to get right, I can always try to make this more abstract if it makes sense once it's merged.

@ggerganov
Copy link
Member

I am still trying to wrap around my head around the root cause of the issue.

The issue is that the gpt-oss template returns different end tags: <|return|> when add_generation_prompt is false, and <|end|> when add_generation_prompt is true.

What is the reason for this behavior? Is it a bug in the template?

@CISC
Copy link
Collaborator

CISC commented Aug 20, 2025

What is the reason for this behavior? Is it a bug in the template?

It seems to be deliberate:
https://huggingface.co/openai/gpt-oss-20b/blob/cbf31f62664d4b1360b3a78427f7b3c3ed8f0fa8/chat_template.jinja#L302-L316

@danbev
Copy link
Member Author

danbev commented Aug 20, 2025

What is the reason for this behavior? Is it a bug in the template?

I don't think this is a bug in the template as it looks intentional. The outputs are different depending on the passed in add_generation_prompt:

{%- elif loop.last and not add_generation_prompt %}
            {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
            {#- This is a situation that should only occur in training, never in inference. #}
            {%- if "thinking" in message %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
            {%- endif %}

            {#- <|return|> indicates the end of generation, but <|end|> does not #}
            {#- <|return|> should never be an input to the model, but we include it as the final token #}
            {#- when training, so the model learns to emit it. #}
---->       {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
        {%- else %}
            {#- CoT is dropped during all previous turns, so we never render it for inference #}
---->       {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
            {%- set last_tool_call.name = none %}
        {%- endif %}

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - it has implications when training and hence the current chat template logic.

@danbev danbev merged commit 657b8a7 into ggml-org:master Aug 20, 2025
47 checks passed
@danbev danbev deleted the gpt-oss-chat-template-return branch August 21, 2025 10:19
qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request Aug 22, 2025
This commit addresses an inconsistency during inference by adding a new
member to the `templates_params` struct to indicate whether the chat is
in inference mode. This allows the gpt-oss specific function
`common_chat_params_init_gpt_oss` to check this flag and the
`add_generation_prompt` flag to determine if it should replace the
`<|return|>` token with the `<|end|>` token in the prompt.

The motivation for this change is to ensure that the formatted prompt of
past messages in `common_chat_format_single` matches the output of the
formatted new message. The issue is that the gpt-oss template returns
different end tags: `<|return|>` when `add_generation_prompt` is false,
and `<|end|>` when `add_generation_prompt` is true. This causes the
substring function to start at an incorrect position, resulting in
tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: ggml-org#15417
Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request Aug 22, 2025
This commit addresses an inconsistency during inference by adding a new
member to the `templates_params` struct to indicate whether the chat is
in inference mode. This allows the gpt-oss specific function
`common_chat_params_init_gpt_oss` to check this flag and the
`add_generation_prompt` flag to determine if it should replace the
`<|return|>` token with the `<|end|>` token in the prompt.

The motivation for this change is to ensure that the formatted prompt of
past messages in `common_chat_format_single` matches the output of the
formatted new message. The issue is that the gpt-oss template returns
different end tags: `<|return|>` when `add_generation_prompt` is false,
and `<|end|>` when `add_generation_prompt` is true. This causes the
substring function to start at an incorrect position, resulting in
tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: ggml-org#15417
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: chat template diff logic causes incomplete tokenization with GPT-OSS model due to <|return|> vs <|end|> inconsistency

4 participants