You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/advanced-data-preprocessing.md
+72-1Lines changed: 72 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -119,7 +119,7 @@ Users can create a data config file in any of YAML or JSON format they choose (w
119
119
- `type` (optional, str): Type of data preprocessor, `default` is currently the only supported type.
120
120
- `streaming` (optional, bool): Stream datasets using [IterableDatasets](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.IterableDataset).
121
121
- `sampling_stopping_strategy` (optional, str): Dataset interleave stopping strategy in case of choosing to mix multiple datasets by weight, supported values are [`all_exhausted` or `first_exhausted`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy), defaults to `all_exhausted`.
122
-
- `sampling_seed` (optional, int): [Sampling seed](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to use for interleaving datasets, for reproducibility choose same value, defaults to 42.
122
+
- `seed` (optional, int): [seed](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to use for interleaving datasets, for reproducibility choose same value, defaults to 42.
123
123
- `chat_template` (optional, str): pass `chat_template` via data_config for multi-turn data, replaces existing default chat template.
124
124
125
125
`datasets` (list):
@@ -129,6 +129,7 @@ Users can create a data config file in any of YAML or JSON format they choose (w
129
129
- `rename_columns` (optional, dict[str:str]): Specifies a dictionary of columns to rename like `{"old_name": "new_name"}` at dataset load time. *Applied before `retain_columns` if both are specified*.
130
130
- `retain_columns` (optional, list[str]): Specifies a list of columns to retain `["input_ids", "labels"]` every other column will be dropped at dataset load time. *Applied strictly after `rename_columns` if both are specified*.
131
131
- `sampling` (optional, float): The sampling ratio (0.0 to 1.0) with which to sample a dataset in case of interleaving.
132
+
- `split` (optional, dict[str: float]): Defines how to split the dataset into training and validation sets. Requires both `train` and `validation` keys.
132
133
- `data_handlers` (optional, list): A list of data handler configurations which preprocess the dataset.
133
134
134
135
Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use [Hugging Face Map API](https://huggingface.co/docs/datasets/en/process#map) to apply these routines.
@@ -184,6 +185,76 @@ We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2
184
185
185
186
Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
186
187
188
+
### Dataset Splitting
189
+
190
+
In addition to [sampling and mixing](#data-mixing), our library supports **dataset splitting**, which allows users to split a dataset into training and validation splits using the `split` field in the dataset config.
191
+
192
+
This is especially useful when users want to split a single dataset (or multiple datasets) internally instead of supplying separate files for training and validation.
193
+
194
+
#### How to Use
195
+
196
+
The `split` field in each dataset config allows users to internally divide a dataset into `train` and `validation` sets using fractional ratios.
197
+
198
+
To use it, specify both `train` and `validation` ratios values under the `split` key for each dataset. Example:
199
+
200
+
```yaml
201
+
datasets:
202
+
- name: my_dataset
203
+
split:
204
+
train: 0.8
205
+
validation: 0.2
206
+
data_paths:
207
+
- "path/to/data.jsonl"
208
+
```
209
+
210
+
### Split Support for Streaming vs Non-Streaming Datasets
- Partial splits like `train: 0.8, validation: 0.2` are not supported and will raise a `NotImplementedError`.
223
+
- If no `split` is defined, the dataset is returned unchanged.
224
+
- Streaming behavior must be explicitly enabled via `dataprocessor.streaming: true`.
225
+
226
+
### Using Separate Files for Train and Validation Splits
227
+
228
+
If you want to use **separate files for training and validation**, you can define them as **separate dataset entries** in the `datasets` section of your config.
229
+
In each entry:
230
+
231
+
- Specify the corresponding file in the `data_paths` field.
232
+
- Set the `split` value to either `train: 1.0` or `validation: 1.0` as appropriate.
233
+
234
+
This allows you to fully control which file is used for which purpose, without relying on automatic or in-place splitting.
235
+
236
+
#### Example
237
+
238
+
```yaml
239
+
datasets:
240
+
- name: my_train_set
241
+
split:
242
+
train: 1.0
243
+
data_paths:
244
+
- "path/to/train.jsonl"
245
+
- name: my_val_set
246
+
split:
247
+
validation: 1.0
248
+
data_paths:
249
+
- "path/to/val.jsonl"
250
+
```
251
+
252
+
### **Note:**
253
+
> - While passing a validation dataset via the command line is possible using the `validation_data_path` argument, **this argument is not compatible with `data_config`**. If you're using a `data_config`, define the validation set within it using a `split: validation: 1.0` entry instead as shown [here](#using-separate-files-for-train-and-validation-splits).
254
+
> - Dataset splitting is performed based on the `split` configuration, supporting only `"train"` and `"validation"` splits. Support for a `"test"` split is not yet available.
255
+
> - **Only the `"train"` split is sampled**, and **sampling is done after splitting**. This ensures that validation remains consistent and unbiased, while allowing training to be performed on a controlled subset if desired.
256
+
> - **⚠️ Users must explicitly set the [`eval_strategy`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_strategy) in the Trainer's arguments to a valid value (e.g., `"steps"` or `"epoch"`) for evaluation to run. Splitting the dataset alone does not trigger evaluation and will likely result in an error if `evaluation_strategy` is left unset.**
257
+
187
258
### Data Streaming
188
259
Dataset streaming allows users to utilize the functionality of iterable datasets to pass in data piece by piece, avoiding memory constraints with large datasets for use-cases like extended pre-training.
0 commit comments