Skip to content

Commit 0eaca37

Browse files
update: updates documentation with chat template guide flowchart (#445)
Signed-off-by: yashasvi <[email protected]>
1 parent f22e243 commit 0eaca37

File tree

2 files changed

+12
-0
lines changed

2 files changed

+12
-0
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,18 @@ For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-
175175

176176
The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.
177177

178+
Depending on various scenarios users might need to decide on how to use chat template with their data or which chat template to use for their use case.
179+
180+
Following are the Guidelines from us in a flow chart :
181+
![guidelines for chat template](docs/images/chat_template_guide.jpg)
182+
183+
Here are some scenarios addressed in the flow chart:
184+
1. Depending on the model the tokenizer for the model may or may not have a chat template
185+
2. If the template is available then the `json object schema` of the dataset might not match the chat template's `string format`
186+
3. There might be special tokens used in chat template which the tokenizer might be unaware of, for example `<|start_of_role|>` which can cause issues during tokenization as it might not be treated as a single token
187+
188+
189+
178190
### 4. Pre tokenized datasets.
179191

180192
Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g.
995 KB
Loading

0 commit comments

Comments
 (0)