You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/advanced-data-preprocessing.md
+51Lines changed: 51 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -277,6 +277,57 @@ Note: Streaming datasets or use of `IterableDatasets` is not compatible with the
277
277
278
278
If the dataset size is known to the user, `max_steps` can be calculated as the total number of samples divided by the batch size.
279
279
280
+
### How users can specify the chat template
281
+
282
+
In the `data_config.yaml` file:
283
+
284
+
**✅ USE:**
285
+
286
+
```yaml
287
+
dataprocessor:
288
+
chat_template: "my single line chat template"
289
+
```
290
+
291
+
The recommended way is to copy paste the chat template from the official checkpoint https://huggingface.co/ibm-granite/granite-3.1-8b-instruct/blob/main/tokenizer_config.json#L188
292
+
293
+
294
+
**✅ (Optional) USE:**
295
+
296
+
```yaml
297
+
dataprocessor:
298
+
chat_template: |
299
+
my multi-line chat template
300
+
```
301
+
302
+
Specifying a multi-line chat template will requires some manual effort on the user's part to ensure new lines are specified correctly.
303
+
This approach is mainly useful for readability, especially if you are customizing the chat template.
304
+
305
+
Example:
306
+
307
+
```yaml
308
+
dataprocessor:
309
+
chat_template: |
310
+
{%- if messages[0]['role'] == 'system' %}
311
+
{%- set system_message = messages[0]['content'] %}
312
+
{%- set loop_messages = messages[1:] %}
313
+
{%- else %}
314
+
{%- set system_message = "Knowledge Cutoff Date: April 2024.
315
+
Today's Date: " + strftime_now('%B %d, %Y') + ".
316
+
You are Granite, developed by IBM." %}
317
+
{%- if tools and documents %}
318
+
................
319
+
```
320
+
321
+
**❌ DO NOT USE:**
322
+
323
+
```yaml
324
+
dataprocessor:
325
+
chat_template: |
326
+
my single line chat template
327
+
```
328
+
329
+
This can add extra backslashes to your chat template causing it to become invalid.
330
+
280
331
### Example data configs.
281
332
282
333
We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
0 commit comments