You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/advanced-data-preprocessing.md
+21-7Lines changed: 21 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ process the datasets. Users can currently pass both YAML or JSON based configura
34
34
The data config schema is designed to define datasets and their processing strategies in a structured way.
35
35
36
36
It consists of the following top-level keys:
37
-
-`datapreprocessor`: Defines global data processing parameters, such as the type (`default`), sampling stopping strategy (`all_exhausted` or `first_exhausted`), and sampling seed for reproducibility.
37
+
-`datapreprocessor`: Defines global data processing parameters, such as the type (`default` or `odm`), sampling stopping strategy (`all_exhausted` or `first_exhausted`), and sampling seed for reproducibility.
38
38
-`datasets`: A list of dataset configurations, each describing the dataset name, paths, optional builders, sampling ratios, and data handlers.
39
39
40
40
At the top level, the data config schema looks like this:
@@ -129,11 +129,29 @@ definitions:
129
129
Users can create a data config file in any of YAML or JSON format they choose (we provide examples of YAML for ease of use). The file should follow the schema outlined above with the following parameters:
130
130
131
131
`datapreprocessor`:
132
-
- `type` (optional, str): Type of data preprocessor, `default` is currently the only supported type.
132
+
- `type` (optional, str): Type of data preprocessor, `default` and `odm` are the two types supported. Use of `odm` requires [installation](./tuning-techniques.md#fms-acceleration) of `fms_acceleration_odm` package.
133
133
- `streaming` (optional, bool): Stream datasets using [IterableDatasets](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.IterableDataset).
134
134
- `sampling_stopping_strategy` (optional, str): Dataset interleave stopping strategy in case of choosing to mix multiple datasets by weight, supported values are [`all_exhausted` or `first_exhausted`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy), defaults to `all_exhausted`.
135
135
- `seed` (optional, int): [seed](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to use for interleaving datasets, for reproducibility choose same value, defaults to 42.
136
136
- `chat_template` (optional, str): pass `chat_template` via data_config for multi-turn data, replaces existing default chat template.
137
+
- `odm` (optional): if `type` is odm, this field is required to be specific to provide configuration for online data mixing.
138
+
139
+
Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use [Hugging Face Map API](https://huggingface.co/docs/datasets/en/process#map) to apply these routines.
140
+
These functions can process the dataset in any way users require and the `list` of data handlers specified for each dataset are applied in order.
141
+
Each data handler has:
142
+
- `name`: The handler's unique identifier.
143
+
- `arguments`: A dictionary of parameters specific to the handler.
144
+
145
+
#### Online data mixing section
146
+
147
+
`odm`config has the following fields and is required when `datapreprocessor` `type` is `odm`.
148
+
149
+
`odm`:
150
+
`update_interval` (optional, int, defaults to `1`): Multi-Armed Bandit (MAB) is used to learn from the training signals and then provide mixing probabilities across datasets. `update_interval` defines the frequency of updating the MAB with training signals in terms of step count.
151
+
`sampling_interval` (optional, int, defaults to `1`): Defines the frequency of choosing a dataset to sample from through MAB. The value is provided in terms of sample count.
152
+
`reward_type` (optional, str, defaults to `entropy`): Type of reward to be used to update MAB. Currently supported rewards are `train_loss`, `validation_loss`, `entropy`, `entropy3_varent1`, `entropy_last_token`, `gradnorm`. More details can be found [here](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/online-data-mixing#rewards).
153
+
`gamma` (optional, int, defaults to `0.1`): MAB hyper-parameter which is similar to exploration factor.
154
+
`eta` (optional, int, defaults to `0.1`): MAB hyper-parameter which is similar to learning rate.
137
155
138
156
`datasets` (list):
139
157
- `name` (optional, str): A unique identifier for the dataset.
@@ -143,11 +161,6 @@ Users can create a data config file in any of YAML or JSON format they choose (w
143
161
- `split` (optional, dict[str: float]): Defines how to split the dataset into training and validation sets. Requires both `train` and `validation` keys.
144
162
- `data_handlers` (optional, list): A list of data handler configurations which preprocess the dataset.
145
163
146
-
Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use [Hugging Face Map API](https://huggingface.co/docs/datasets/en/process#map) to apply these routines.
147
-
These functions can process the dataset in any way users require and the `list` of data handlers specified for each dataset are applied in order.
148
-
Each data handler has:
149
-
- `name`: The handler's unique identifier.
150
-
- `arguments`: A dictionary of parameters specific to the handler.
151
164
152
165
We do provide some sample `data_configs` here, [predefined_data_configs](../tests/artifacts/predefined_data_configs/).
153
166
@@ -192,6 +205,7 @@ We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2
192
205
193
206
Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
194
207
208
+
195
209
### Dataset Splitting
196
210
197
211
In addition to [sampling and mixing](#data-mixing), our library supports **dataset splitting**, which allows users to split a dataset into training and validation splits using the `split` field in the dataset config.
-`--fast_moe`: trains MoE models in parallel with [Scatter MoE kernels](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/accelerated-moe#fms-acceleration-for-mixture-of-experts), increasing throughput and decreasing memory usage.
473
+
-[odm_config](./tuning/config/acceleration_configs/odm.py) (experimental): See [online data mixing](./online-data-mixing.md) and [PyTorch conf poster](https://static.sched.com/hosted_files/pytorchconference/70/PyTorch%20Native%20Online%20Dynamic%20Reward%20Based%20Data%20Mixing%20Framework.pdf) for usage with data_config. This plugin allows dynamically mixing datasets online during training adapting to training signals.
473
474
474
-
Notes:
475
+
Notes:
475
476
*`quantized_lora_config` requires that it be used along with LoRA tuning technique. See [LoRA tuning section](https://github.com/foundation-model-stack/fms-hf-tuning/tree/main?tab=readme-ov-file#lora-tuning-example) on the LoRA parameters to pass.
476
477
* When setting `--auto_gptq triton_v2` plus note to also pass `--torch_dtype float16` and `--fp16`, or an exception will be raised. This is because these kernels only support this dtype.
477
478
* When using `fused_ops_and_kernels` together with `quantized_lora_config`,
0 commit comments