-
Notifications
You must be signed in to change notification settings - Fork 13.7k
chat: handle gpt-oss return/end token inconsistency #15421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<|return|>` token with the `<|end|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<|return|>` when `add_generation_prompt` is false, and `<|end|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'. Resolves: ggml-org#15417
|
Something seems to be missing, when is |
I was thinking this would be set to false for training. Not sure if there is a place that requires this to be updated at the moment though (or maybe I'm just not aware of it). |
Ah I see, may be useful for @gabe-l-hart aLoRA PR? |
|
Interesting! I think this is close to what the |
|
I am still trying to wrap around my head around the root cause of the issue.
What is the reason for this behavior? Is it a bug in the template? |
It seems to be deliberate: |
I don't think this is a bug in the template as it looks intentional. The outputs are different depending on the passed in {%- elif loop.last and not add_generation_prompt %}
{#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
{#- This is a situation that should only occur in training, never in inference. #}
{%- if "thinking" in message %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
{%- endif %}
{#- <|return|> indicates the end of generation, but <|end|> does not #}
{#- <|return|> should never be an input to the model, but we include it as the final token #}
{#- when training, so the model learns to emit it. #}
----> {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
{%- else %}
{#- CoT is dropped during all previous turns, so we never render it for inference #}
----> {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
{%- set last_tool_call.name = none %}
{%- endif %}
|
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - it has implications when training and hence the current chat template logic.
This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<|return|>` token with the `<|end|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<|return|>` when `add_generation_prompt` is false, and `<|end|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'. Resolves: ggml-org#15417
This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<|return|>` token with the `<|end|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<|return|>` when `add_generation_prompt` is false, and `<|end|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'. Resolves: ggml-org#15417
This commit addresses an inconsistency during inference by adding a new member to the
templates_paramsstruct to indicate whether the chat is in inference mode. This allows the gpt-oss specific functioncommon_chat_params_init_gpt_ossto check this flag and theadd_generation_promptflag to determine if it should replace the<|return|>token with the<|end|>token in the prompt.The motivation for this change is to ensure that the formatted prompt of past messages in
common_chat_format_singlematches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags:<|return|>whenadd_generation_promptis false, and<|end|>whenadd_generation_promptis true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'.Resolves: #15417