Skip to content

Commit 7099343

Browse files
authored
feat: add eos_tokens and train_on_eot for chat_template EOT parsing (axolotl-ai-cloud#2364)
* feat: add eos_tokens and train_on_eot for chat_template EOT parsing * fix: comments * chore: add some examples of tokens * feat: add new potential errors for chat_template to faq * feat: add examples for EOT handling * fix: change error to warning for missing EOS * fix: warning typo * feat: add tests for eot token handling * fix: remove broken caplog capture in test * fix: chattemplate strategy with kd missing eot changes
1 parent 5000cb3 commit 7099343

File tree

7 files changed

+575
-50
lines changed

7 files changed

+575
-50
lines changed

docs/config.qmd

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ datasets:
187187
# IMPORTANT: The following fields determine which parts of the conversation to train on.
188188
# Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train
189189
# See examples at `docs/dataset-formats/conversation.qmd`
190-
# Note: If the below 4 fields are set to empty, defaults to training only on the last message.
190+
# Note: If the below 5 fields are empty, defaults to training only on the last message.
191191

192192
# Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss.
193193
roles_to_train: ["assistant"] # default
@@ -196,7 +196,13 @@ datasets:
196196
# - turn (default): train on the EOS token at the end of each trainable turn
197197
# - last: train on the last EOS token in the conversation
198198
# TIP: Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
199-
train_on_eos: last
199+
train_on_eos: turn
200+
# Optional[str]. Which EOT (End-of-Turn) tokens to train on in the conversation. Possible values are:
201+
# - all: train on all EOT tokens
202+
# - turn: train on the EOT token at the end of each trainable turn
203+
# - last: train on the last EOT token in the conversation
204+
# If not specified, defaults to the value of train_on_eos for backward compatibility.
205+
train_on_eot:
200206
# The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`.
201207
message_field_training: training
202208
# The key in the message turn that contains the training details. Useful to selectively train on certain tokens in a turn.
@@ -279,8 +285,17 @@ process_reward_model:
279285
chat_template: tokenizer_default
280286
# custom jinja template for chat template. This will be only used if chat_template is set to `jinja` or `null` (in which case chat_template is automatically set to `jinja`). Default is null.
281287
chat_template_jinja: null
282-
# Changes the default system message. Currently only supports chatml.
283-
default_system_message: You are a helpful assistant. Please give a long and detailed answer.
288+
# Optional[List[str]]. Custom EOT (End-of-Turn) tokens to mask/unmask during training.
289+
# These tokens mark the boundaries between conversation turns.
290+
# For example: ["/INST", "</s>", "[/SYSTEM_PROMPT]"]
291+
# If not specified, defaults to just the model's eos_token.
292+
# This is useful for templates that use multiple delimiter tokens.
293+
eot_tokens:
294+
# - "</s>"
295+
# - "[/INST]"
296+
# - "[/SYSTEM_PROMPT]"
297+
# Changes the default system message
298+
default_system_message: You are a helpful assistant. Please give a long and detailed answer. # Currently only supports chatml.
284299
# Axolotl attempts to save the dataset as an arrow after packing the data together so
285300
# subsequent training attempts load faster, relative path
286301
dataset_prepared_path: data/last_run_prepared
@@ -665,8 +680,10 @@ special_tokens:
665680
# unk_token: "<unk>"
666681
# pad_token: "[PAD]"
667682

668-
# Add extra tokens.
683+
# Optional[list[str]]. Add extra tokens to the tokenizer.
669684
tokens:
685+
# - "<|startoftext|>"
686+
# - "<|endoftext|>"
670687

671688
# Mapping token_id to new_token_string to override reserved added_tokens in the tokenizer.
672689
# Only works for tokens that are not part of the base vocab (aka are added_tokens).

docs/dataset-formats/conversation.qmd

Lines changed: 60 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,6 @@ description: Conversation format for supervised fine-tuning.
44
order: 3
55
---
66

7-
## sharegpt
8-
9-
::: {.callout-important}
10-
ShareGPT is deprecated!. Please see [chat_template](#chat_template) section below.
11-
:::
12-
13-
## pygmalion
14-
15-
```{.json filename="data.jsonl"}
16-
{"conversations": [{"role": "...", "value": "..."}]}
17-
```
18-
197
## chat_template
208

219
Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Support using tokenizer's template, a supported template, or custom jinja2.
@@ -64,7 +52,7 @@ We recommend checking the below examples for other usecases.
6452
6553
### Examples
6654
67-
1. Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.
55+
1. (Legacy) Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.
6856
6957
```yaml
7058
datasets:
@@ -109,10 +97,55 @@ datasets:
10997
```
11098
11199
::: {.callout-important}
112-
Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
100+
Please make sure that your `tokenizer.eos_token` is same as EOS (End-of-Sequence) token in template. Otherwise, set `eos_token` under `special_tokens: `.
101+
:::
102+
103+
5. If you are using a template that has a different EOT (End-of-Turn) token from EOS token or multiple EOT tokens (like Mistral V7 Tekken), set the `eot_tokens: ` config. The handling of EOT tokens follows `train_on_eos: ` which defaults to turn.
104+
105+
```yaml
106+
eot_tokens:
107+
- "[/INST]"
108+
# - "[/SYSTEM_PROMPT]"
109+
110+
datasets:
111+
- path: ...
112+
type: chat_template
113+
114+
# optional
115+
train_on_eot: turn # defaults read from train_on_eos (which defaults to turn)
116+
```
117+
118+
::: {.callout-tip}
119+
See [config documentation](../config.qmd) for detailed explanations of "turn", "last", and "all" options for training on tokens.
120+
:::
121+
122+
::: {.callout-note}
123+
Using `eot_tokens` requires each token that exists in `chat_template` to be a single token in the tokenizer. Otherwise, the tokenizer will split the token and cause unexpected behavior.
124+
125+
You can add those tokens as new tokens under `tokens: ` or (recommended) override unused added_tokens via `added_tokens_overrides: `. See [config](../config.qmd) for more details.
126+
:::
127+
128+
6. Continuing from the previous example, if you want to train on all EOT token trainable turns but only last EOS token, set `train_on_eos: last`.
129+
130+
```yaml
131+
eot_tokens:
132+
- "[/INST]"
133+
# ...
134+
135+
datasets:
136+
- path: ...
137+
type: chat_template
138+
139+
train_on_eos: last
140+
train_on_eot: turn
141+
```
142+
143+
::: {.callout-tip}
144+
If EOS token only appears at the end of a prompt, `train_on_eos: last` is equivalent to `train_on_eos: turn`. Therefore, generally, you can leave them to their defaults and omit them.
113145
:::
114146

115-
5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
147+
148+
7. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
116149

117150
For a data sample that looks like:
118151

@@ -162,3 +195,15 @@ datasets:
162195
::: {.callout-tip}
163196
It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
164197
:::
198+
199+
## sharegpt
200+
201+
::: {.callout-important}
202+
ShareGPT is deprecated!. Please see [chat_template](#chat_template) section.
203+
:::
204+
205+
## pygmalion
206+
207+
```{.json filename="data.jsonl"}
208+
{"conversations": [{"role": "...", "value": "..."}]}
209+
```

docs/faq.qmd

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,10 +73,40 @@ description: Frequently asked questions
7373
7474
> A: This is likely an empty turn.
7575
76-
**Q: The EOS/EOT token is incorrectly being masked or not being masked.**
76+
**Q: The EOS token is incorrectly being masked or not being masked / `EOS token __ not found in chat template`.**
7777
78-
> A: This is because of the mismatch between `tokenizer.eos_token` and EOS/EOT token in template. Please make sure to set `eos_token` under `special_tokens` to the same EOS/EOT token as in template.
78+
> A: There can be two reasons:
79+
80+
> 1. This is because of the mismatch between `tokenizer.eos_token` and EOS token in template. Please make sure to set `eos_token: ` under `special_tokens: ` to the same EOS token as in template.
81+
82+
> 2. The EOS token is not in the template. Please check if your template is correct. As an example, `phi_35` template does not use its dedicated EOS token `<|endoftext|>` at the end.
7983
8084
**Q: "`chat_template` choice is `tokenizer_default` but tokenizer's `chat_template` is null. Please add a `chat_template` in tokenizer config"**
8185
8286
> A: This is because the tokenizer does not have a chat template. Please add a chat template in the tokenizer config. See [chat_template](dataset-formats/conversation.qmd#chat-template) for more details.
87+
88+
**Q: The EOT token(s) are incorrectly being masked or not being masked / `EOT token __ not found in chat template`.**
89+
90+
> A: There can be two reasons:
91+
92+
> 1. The EOT token is different from the EOS token and was not specified under `eot_tokens: `. Please set `eot_tokens: ` to the same EOT token(s) as in template.
93+
94+
> 2. There is more than one EOT token per turn in the template. Please raise an issue with examples as we recognize this as an edge case.
95+
96+
**Q: `EOT token encoding failed. Please check if the token is valid and can be encoded.`**
97+
98+
> A: There could be some issue with the tokenizer or unicode encoding. Please raise an issue with examples with the EOT token & tokenizer causing the issue.
99+
100+
**Q: `EOT token __ is encoded as multiple tokens.`**
101+
102+
> A: This is because the EOT token is encoded as multiple tokens which can cause unexpected behavior. Please add it under `tokens: ` or (recommended) override unused added_tokens via `added_tokens_overrides: `.
103+
104+
**Q: `Conflict between train_on_eos and train_on_eot. eos_token is in eot_tokens and train_on_eos != train_on_eot`**
105+
106+
> A: This is because the EOS token is in the `eot_tokens: ` while mismatch between `train_on_eos: ` and `train_on_eot: `. This will cause one to override the other. Please ensure that `train_on_eos: ` and `train_on_eot: ` are the same or remove the EOS token from `eot_tokens: `.
107+
108+
**Q: If `eot_tokens: ` is not provided, what happens?**
109+
110+
> A: If `eot_tokens: ` is not provided, the default behavior is the same as before. EOS tokens used to delimit turns are masked/unmasked depending on whether the turn is trainable.
111+
112+
> Internally, `eot_tokens: tokenizer.eos_token` and `train_on_eot: train_on_eos` (which defaults to `turn`). This transition helps clarify the naming and behavior of EOT/EOS tokens.

src/axolotl/integrations/kd/chat_template.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ def __init__(
3535
sequence_len,
3636
roles_to_train=None,
3737
train_on_eos=None,
38+
train_on_eot=None,
39+
eot_tokens=None,
3840
logprobs_field="logprobs",
3941
gen_temperature=1.0,
4042
kd_temperature=1.0,
@@ -50,6 +52,8 @@ def __init__(
5052
sequence_len,
5153
roles_to_train=roles_to_train,
5254
train_on_eos=train_on_eos,
55+
train_on_eot=train_on_eot,
56+
eot_tokens=eot_tokens,
5357
)
5458

5559
@property

0 commit comments

Comments
 (0)