Skip to content

Commit d5db867

Browse files
committed
docs: add docs
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
1 parent dc03331 commit d5db867

File tree

4 files changed

+71
-3
lines changed

4 files changed

+71
-3
lines changed

build/Dockerfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,12 +165,14 @@ RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
165165
# fms_acceleration_foak = Fused LoRA and triton kernels
166166
# fms_acceleration_aadp = Padding-Free Flash Attention Computation
167167
# fms_acceleration_moe = Parallelized Mixture of Experts
168+
# fms_acceleration_odm = Online Data Mixing
168169
RUN if [[ "${ENABLE_FMS_ACCELERATION}" == "true" ]]; then \
169170
python -m pip install --user "$(head bdist_name)[fms-accel]"; \
170171
python -m fms_acceleration.cli install fms_acceleration_peft; \
171172
python -m fms_acceleration.cli install fms_acceleration_foak; \
172173
python -m fms_acceleration.cli install fms_acceleration_aadp; \
173174
python -m fms_acceleration.cli install fms_acceleration_moe; \
175+
python -m fms_acceleration.cli install fms_acceleration_odm; \
174176
fi
175177

176178
RUN if [[ "${ENABLE_AIM}" == "true" ]]; then \

build/nvcr.Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,8 @@ RUN if [[ "${ENABLE_FMS_ACCELERATION}" == "true" ]]; then \
5757
python -m fms_acceleration.cli install fms_acceleration_peft && \
5858
python -m fms_acceleration.cli install fms_acceleration_foak && \
5959
python -m fms_acceleration.cli install fms_acceleration_aadp && \
60-
python -m fms_acceleration.cli install fms_acceleration_moe; \
60+
python -m fms_acceleration.cli install fms_acceleration_moe && \
61+
python -m fms_acceleration.cli install fms_acceleration_odm; \
6162
fi
6263

6364
RUN if [[ "${ENABLE_ALORA}" == "true" ]]; then \

docs/advanced-data-preprocessing.md

Lines changed: 66 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ Our library also supports a powerful data processing backend which can be used b
44
1. Creating custom data processing pipeline for the datasets.
55
1. Combining multiple datasets into one, even if they have different formats.
66
1. Mixing datasets as required and sampling each dataset with different weights.
7+
1. Dynamically mixing datasets online based on training signals through fms_acceleration_odm plugin.
78

89
These things are supported via what we call a [`data_config`](#data-config) which can be passed as an argument to sft trainer.
910

@@ -34,7 +35,7 @@ process the datasets. Users can currently pass both YAML or JSON based configura
3435
The data config schema is designed to define datasets and their processing strategies in a structured way.
3536

3637
It consists of the following top-level keys:
37-
- `datapreprocessor`: Defines global data processing parameters, such as the type (`default`), sampling stopping strategy (`all_exhausted` or `first_exhausted`), and sampling seed for reproducibility.
38+
- `datapreprocessor`: Defines global data processing parameters, such as the type (`default` or `odm`), sampling stopping strategy (`all_exhausted` or `first_exhausted`), and sampling seed for reproducibility.
3839
- `datasets`: A list of dataset configurations, each describing the dataset name, paths, optional builders, sampling ratios, and data handlers.
3940

4041
At the top level, the data config schema looks like this:
@@ -129,11 +130,21 @@ definitions:
129130
Users can create a data config file in any of YAML or JSON format they choose (we provide examples of YAML for ease of use). The file should follow the schema outlined above with the following parameters:
130131
131132
`datapreprocessor`:
132-
- `type` (optional, str): Type of data preprocessor, `default` is currently the only supported type.
133+
- `type` (optional, str): Type of data preprocessor, `default` and `odm` are the two types supported. Use of `odm` requires [installation](./tuning-techniques.md#fms-acceleration) of `fms_acceleration_odm` package.
133134
- `streaming` (optional, bool): Stream datasets using [IterableDatasets](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.IterableDataset).
134135
- `sampling_stopping_strategy` (optional, str): Dataset interleave stopping strategy in case of choosing to mix multiple datasets by weight, supported values are [`all_exhausted` or `first_exhausted`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy), defaults to `all_exhausted`.
135136
- `seed` (optional, int): [seed](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to use for interleaving datasets, for reproducibility choose same value, defaults to 42.
136137
- `chat_template` (optional, str): pass `chat_template` via data_config for multi-turn data, replaces existing default chat template.
138+
- `odm` (optional): if `type` is odm, this field is required to be specific to provide configuration for online data mixing.
139+
140+
`odm` config has the following fields and is required when `datapreprocessor` `type` is `odm`.
141+
142+
`odm`:
143+
`update_interval` (optional, int, defaults to `1`): Multi-Armed Bandit (MAB) is used to learn from the training signals and then provide mixing probabilities across datasets. `update_interval` defines the frequency of updating the MAB with training signals in terms of step count.
144+
`sampling_interval` (optional, int, defaults to `1`): Defines the frequency of choosing a dataset to sample from through MAB. The value is provided in terms of sample count.
145+
`reward_type` (optional, str, defaults to `entropy`): Type of reward to be used to update MAB. Currently supported rewards are `train_loss`, `validation_loss`, `entropy`, `entropy3_varent1`, `entropy_last_token`, `gradnorm`. More details can be found [here](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/online-data-mixing#rewards).
146+
`gamma` (optional, int, defaults to `0.1`): MAB hyper-parameter which is similar to exploration factor.
147+
`eta` (optional, int, defaults to `0.1`): MAB hyper-parameter which is similar to learning rate.
137148

138149
`datasets` (list):
139150
- `name` (optional, str): A unique identifier for the dataset.
@@ -192,6 +203,59 @@ We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2
192203

193204
Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
194205

206+
### Online Data Mixing
207+
Dataset mixing can be dynamic in nature that adapts online during the training based on the training signals. We provide this feature through fms_acceleration_odm plugin and more details can be found [here](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/online-data-mixing).
208+
209+
#### How to Use
210+
211+
`dataprocessor` `type` has to be set to `odm` and then `odm` config should be provided in the `odm` section of the data config file. An example is shown below:
212+
213+
```yaml
214+
dataprocessor:
215+
type: odm
216+
odm:
217+
update_interval: 1 # update every step
218+
sampling_interval: 1 # sample category for every sample
219+
reward_type: validation_loss # uses eval loss of each dataset as reward
220+
gamma: 0.1 # MAB hyper-parameter
221+
eta: 0.2 # MAB hyper-parameter
222+
```
223+
224+
Here `update_interval` is set to `1` which is to update MAB on every step with validation loss as reward across the datasets. `sampling_interval` is set to `1` which is to choose a dataset to sample for every sample. `reward_type` is set to `validation_loss` to use validation loss across datasets as a training signal to reward MAB decisions during training. Example `datasets` section can look like below:
225+
226+
```yaml
227+
datasets:
228+
- name: dataset_1
229+
split:
230+
train: 0.8
231+
validation: 0.2
232+
data_paths:
233+
- "FILE_PATH"
234+
data_handlers:
235+
- name: tokenize_and_apply_input_masking
236+
arguments:
237+
remove_columns: all
238+
batched: false
239+
fn_kwargs:
240+
input_column_name: input
241+
output_column_name: output
242+
- name: dataset_2
243+
split:
244+
train: 0.9
245+
validation: 0.1
246+
data_paths:
247+
- "FILE_PATH"
248+
data_handlers:
249+
- name: tokenize_and_apply_input_masking
250+
arguments:
251+
remove_columns: all
252+
batched: false
253+
fn_kwargs:
254+
input_column_name: input
255+
output_column_name: output
256+
```
257+
As you notice, `validation` under `split` is provided for each of the datasets and is necessary to be provided since the `reward_type` is `validation_loss` which requires validation datasets to be available. Same applies to the following rewards: `validation_loss`, `entropy`, `entropy3_varent1`, and `entropy_last_token`. While reward_types `train_loss` and `gradnorm` do not require validation split.
258+
195259
### Dataset Splitting
196260

197261
In addition to [sampling and mixing](#data-mixing), our library supports **dataset splitting**, which allows users to split a dataset into training and validation splits using the `split` field in the dataset config.

docs/tuning-techniques.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -470,6 +470,7 @@ The list of configurations for various `fms_acceleration` plugins:
470470
- `--multipack`: technique for *multi-gpu training* to balance out number of tokens processed in each device, to minimize waiting time.
471471
- [fast_moe_config](./tuning/config/acceleration_configs/fast_moe.py) (experimental):
472472
- `--fast_moe`: trains MoE models in parallel with [Scatter MoE kernels](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/accelerated-moe#fms-acceleration-for-mixture-of-experts), increasing throughput and decreasing memory usage.
473+
- [odm_config](./tuning/config/acceleration_configs/odm.py) (experimental): See [advanced data preprocessing](./advanced-data-preprocessing.md#online-data-mixing) for usage with data_config. This plugin allows dynamically mixing datasets online during training adapting to training signals.
473474

474475
Notes:
475476
* `quantized_lora_config` requires that it be used along with LoRA tuning technique. See [LoRA tuning section](https://github.com/foundation-model-stack/fms-hf-tuning/tree/main?tab=readme-ov-file#lora-tuning-example) on the LoRA parameters to pass.

0 commit comments

Comments
 (0)