You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+14-1Lines changed: 14 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -187,6 +187,19 @@ Here are some scenarios addressed in the flow chart:
187
187
3. There might be special tokens used in chat template which the tokenizer might be unaware of, for example `<|start_of_role|>` which can cause issues during tokenization as it might not be treated as a single token
188
188
189
189
190
+
#### Add Special Tokens
191
+
Working with multi-turn chat data might require the tokenizer to use a few new control tokens ( ex: `<|assistant|>`, `[SYS]` ) as described above in the guidelines. These special tokens might not be present in the tokenizer's vocabulary if the user is using base model.
192
+
193
+
Users can pass `--add_special_tokens` argument which would add the required tokens to the tokenizer's vocabulary.
194
+
For example required special tokens used in `--instruction_template`/`--response_template` can be passed as follows:
-`--fast_moe` is an integer value that configures the amount of expert parallel sharding (ep_degree).
793
806
-`world_size` must be divisible by the `ep_degree`
794
-
- Running fast moe modifies the state dict of the model, and must be post-processed using [checkpoint utils](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py)to run inference (HF, vLLM, etc.).
807
+
- Running fast moe modifies the state dict of the model, and must be post-processed which happens automatically and the converted checkpoint can be found at `hf_converted_checkpoint` folder within every saved checkpoint directory. Alternatively, we can perform similar option manually through [checkpoint utils](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py)script.
Copy file name to clipboardExpand all lines: docs/advanced-data-preprocessing.md
+64Lines changed: 64 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,6 +47,8 @@ definitions:
47
47
type: string
48
48
seed:
49
49
type: integer
50
+
chat_template:
51
+
type: string
50
52
required:
51
53
- type
52
54
title: Dataprocessor
@@ -115,8 +117,10 @@ Users can create a data config file in any of YAML or JSON format they choose (w
115
117
116
118
`datapreprocessor`:
117
119
- `type` (optional, str): Type of data preprocessor, `default` is currently the only supported type.
120
+
- `streaming` (optional, bool): Stream datasets using [IterableDatasets](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.IterableDataset).
118
121
- `sampling_stopping_strategy` (optional, str): Dataset interleave stopping strategy in case of choosing to mix multiple datasets by weight, supported values are [`all_exhausted` or `first_exhausted`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy), defaults to `all_exhausted`.
119
122
- `sampling_seed` (optional, int): [Sampling seed](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to use for interleaving datasets, for reproducibility choose same value, defaults to 42.
123
+
- `chat_template` (optional, str): pass `chat_template` via data_config for multi-turn data, replaces existing default chat template.
120
124
121
125
`datasets` (list):
122
126
- `name` (optional, str): A unique identifier for the dataset.
@@ -229,6 +233,8 @@ This library currently supports the following [preexisting data handlers](https:
229
233
Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
230
234
- `duplicate_columns`:
231
235
Duplicates one column of the dataset to another column.
236
+
- `tokenize`:
237
+
Tokenizes one column of the dataset passed as input `dataset_text_field`.
232
238
233
239
These handlers could be requested by their same name and users can lookup the function args from [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py)
234
240
@@ -251,6 +257,64 @@ We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2
251
257
252
258
`Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset`
253
259
260
+
### Data Streaming
261
+
Dataset streaming allows users to utilize the functionality of iterable datasets to pass in data piece by piece, avoiding memory constraints with large datasets for use-cases like extended pre-training.
262
+
263
+
Users can use streaming by setting `streaming` to `true` in the `datapreprocessor` config. This top-level variable must be set for all datasets in the config, and cannot differ from dataset to dataset. When `streaming` is `true`, the dataset is loaded as an `IterableDataset` ([docs](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.IterableDataset)) instead of a regular `Dataset`, this means the dataset is loaded chunk-by-chunk rather than all at once and is processed lazily. For more details on the differences, see the [HF Blog](https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable).
264
+
265
+
In a data config this looks like (see [ept document](./ept.md#large-non-tokenized-dataset) for a more in-depth example):
266
+
```
267
+
dataprocessor:
268
+
type: default
269
+
streaming: true
270
+
```
271
+
272
+
When using streaming, `split_batches` in the `TrainingArguments` will automatically be set to `True`, by doing so, the main process will fetch a full batch and slice it into `num_processes` batches for each process. This means that `num_processes` must be divisible by `batch_size`. This will replace the global batch size.
273
+
274
+
**When using streaming, the user must set `max_steps` in the `TrainingArguments` instead of `num_train_epochs`.** Since iterable datasets are loaded chunk-by-chunk, data cannot run through epochs in a typical fashion as the **Trainer** can not know length of the dataset as it is being passed through. If both `max_steps` and `num_train_epochs` are given in a training config, `max_steps` will overwrite `num_train_epochs` since `max_steps` directly specifies the total number of optimization steps, which is needed when dataset length cannot be known.
275
+
276
+
If the dataset size is known to the user, `max_steps` can be calculated as the total number of samples divided by the batch size.
277
+
254
278
### Example data configs.
255
279
256
280
We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
281
+
282
+
## Offline Data preprocessing
283
+
284
+
[This script](../scripts/offline_data_processing.py) provides the capability for users to perform standalone data
285
+
preprocessing, decoupled from the tuning/training part. It processes raw datasets, performs data preprocessing, and
286
+
saves the train and validation datasets (in shards if `--num_dataset_shards` if passed) in parquet format inside the specified `output_dir`.
287
+
A data config YAML file can be used to pass configuration to this script. Example command to run this script:
Copy file name to clipboardExpand all lines: docs/ept.md
+29Lines changed: 29 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,6 +107,35 @@ Here also the command line arguments would be
107
107
108
108
The code again would add `EOS_TOKEN` to the non tokenized data before using it and also note that the `dataset_text_field` is assumed to be same across all datasets for now.
109
109
110
+
### Large Non-Tokenized Dataset
111
+
Let's say you have a large JSONL data file that cannot all fit into memory at once and you want to perform EPT on it, you can use the streaming feature to efficiently load and process data in chunks. To enable streaming, you can define a data_config as follows:
112
+
113
+
Sample data config for the above use case.
114
+
```
115
+
dataprocessor:
116
+
type: default
117
+
streaming: true
118
+
datasets:
119
+
- name: non_tokenized_text_dataset
120
+
data_paths:
121
+
- "<path-to-the-jsonl-dataset>"
122
+
data_handlers:
123
+
- name: add_tokenizer_eos_token
124
+
arguments:
125
+
remove_columns: all
126
+
batched: false
127
+
fn_kwargs:
128
+
dataset_text_field: "dataset_text_field"
129
+
```
130
+
131
+
The command-line arguments passed to the library should include the following:
132
+
133
+
```
134
+
--data_config <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
135
+
```
136
+
137
+
Please note when using streaming, user must pass `max_steps` instead of `num_train_epochs`. See advanced data preprocessing [document](./advanced-data-preprocessing.md#data-streaming) for more info.
138
+
110
139
### Additional Information
111
140
This feature is supported post [v2.3.1](https://github.com/foundation-model-stack/fms-hf-tuning/releases/tag/v2.3.1) of this library.
0 commit comments