You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+24-4Lines changed: 24 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,10 +64,11 @@ For more details on how to enable and use the trackers, Please see, [the experim
64
64
## Data Support
65
65
Users can pass training data as either a single file or a Hugging Face dataset ID using the `--training_data_path` argument along with other arguments required for various [use cases](#use-cases-supported-with-training_data_path-argument) (see details below). If user choose to pass a file, it can be in any of the [supported formats](#supported-data-formats). Alternatively, you can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly.
66
66
67
-
68
67
Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md).
69
68
70
-
## Supported Data Formats
69
+
EOS tokens are added to all data formats listed below (EOS token is appended to the end of each data point, like a sentence or paragraph within the dataset), except for pretokenized data format at this time. For more info, see [pretokenized](#4-pre-tokenized-datasets).
70
+
71
+
## Supported Data File Formats
71
72
We support the following file formats via `--training_data_path` argument
72
73
73
74
Data Format | Tested Support
@@ -79,6 +80,11 @@ ARROW | ✅
79
80
80
81
As iterated above, we also support passing a HF dataset ID directly via `--training_data_path` argument.
81
82
83
+
**NOTE**: Due to the variety of supported data formats and file types, `--training_data_path` is handled as follows:
84
+
- If `--training_data_path` ends in a valid file extension (e.g., .json, .csv), it is treated as a file.
85
+
- If `--training_data_path` points to a valid folder, it is treated as a folder.
86
+
- If neither of these are true, the data preprocessor tries to load `--training_data_path` as a Hugging Face (HF) dataset ID.
87
+
82
88
## Use cases supported with `training_data_path` argument
83
89
84
90
### 1. Data formats with a single sequence and a specified response_template to use for masking on completion.
@@ -169,15 +175,29 @@ For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-
169
175
170
176
The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.
171
177
172
-
### 3. Pre tokenized datasets.
178
+
Depending on various scenarios users might need to decide on how to use chat template with their data or which chat template to use for their use case.
179
+
180
+
Following are the Guidelines from us in a flow chart :
181
+

182
+
183
+
Here are some scenarios addressed in the flow chart:
184
+
1. Depending on the model the tokenizer for the model may or may not have a chat template
185
+
2. If the template is available then the `json object schema` of the dataset might not match the chat template's `string format`
186
+
3. There might be special tokens used in chat template which the tokenizer might be unaware of, for example `<|start_of_role|>` which can cause issues during tokenization as it might not be treated as a single token
187
+
188
+
189
+
190
+
### 4. Pre tokenized datasets.
173
191
174
192
Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g.
175
193
194
+
At this time, the data preprocessor does not add EOS tokens to pretokenized datasets, users must ensure EOS tokens are included in their pretokenized data if needed.
For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md).
0 commit comments