You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: architecture_records/004-datapreprocessor.md
+21-20Lines changed: 21 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ This interface should cater to various user expertise levels, enabling basic use
32
32
33
33
### Motivation
34
34
35
-
The main motivation for this ADR stems from the fact that fms-hf-tuning is being used by many teams for a diverse set of use cases which are not currently supported in the library. To be precise, currently in the library for data preposssing we currently take two primary arguments `training_data_path` and `validataion_data_path` which take in a single file location for a dataset.
35
+
The main motivation for this ADR stems from the fact that fms-hf-tuning is being used by many teams for a diverse set of use cases which are not currently supported in the library. To be precise, currently in the library for data preprocessing we currently take two primary arguments `training_data_path` and `validataion_data_path` which take in a single file location for a dataset.
36
36
A user can currently pass in
37
37
1. a pretokenized json(l) dataset via
38
38
```
@@ -55,7 +55,7 @@ The first motivation for a change is the requirements from users asking for diff
55
55
56
56
Also use cases from teams require multiple datasets and even multiple data files in a dataset.
57
57
58
-
Futher requirements from teams is to have a way to interleave datasets at run time by specifying static weights to mix different datasets which is also not supported by the code yet.
58
+
Further requirements from teams is to have a way to interleave datasets at run time by specifying static weights to mix different datasets which is also not supported by the code yet.
59
59
60
60
Finally other requirements are to have preprocesing support for multiple modalities of data (starting with Image first) and have support for advanced preprocesing like jinja based template rendering of the dataset before consumption.
61
61
@@ -133,7 +133,7 @@ Please note that most of the users of product here would fall into the simple us
133
133
1. Ensure the single design can handle these and many more use cases without major changes.
134
134
1. Design for Advanced users while simplify for simple users.
135
135
136
-
We propose to allow advanced users to specify a full spec which exposes data preprocessing API provided by the HF library directly to them to be able to fully utilise the interface.
136
+
We propose to allow advanced users to specify a full spec which exposes data preprocessing API provided by the HF library directly to them to be able to fully utilize the interface.
137
137
138
138
The proposed input spec which user specifies as `data_config` on how to pass information for such preprocessing is
139
139
@@ -173,7 +173,7 @@ datasets:
173
173
batched: false
174
174
```
175
175
176
-
To iterate again, here our goal is not to reimplement the functionality provided by HuggingFace but rather have a clean interface using a config where advanced users can use things like Iterable datasets or Interleaving datasets and perform custom preprocessing like applying jinja templates etc in an easy way.
176
+
To iterate again, here our goal is not to re-implement the functionality provided by HuggingFace but rather have a clean interface using a config where advanced users can use things like Iterable datasets or Interleaving datasets and perform custom preprocessing like applying jinja templates etc in an easy way.
177
177
178
178
In this spec, at top level we have the `Dataprocessor` config which contains just one field `type` which is set to `default`. This is done to ensure any future top level `dataprocessor` configs will go into this block. Users need not touch or provide this as the `default` is automatically selected.
179
179
@@ -208,23 +208,23 @@ By allowing the users to specify data handlers like this we allow them to use fu
208
208
209
209
Furthermore this design allows flexibility to be extended to any upcoming usecase because any operation to be executed on the dataset can be broken down into function execution implemented as data handlers.
210
210
211
-
This makes our spec a complete solution for advanced users of the library allowing them to specify complete preprocessing operations to be applied to the dataset via a config file.
211
+
This makes our spec a complete solution for advanced users of the library, who have custom preprocessing needs. Allowing them to specify complete preprocessing operations to be applied to the dataset via a config file.
212
212
213
213
Finally, with this spec we do not want to break the functionality for the simple users of the library. A simple user which wants to just use the library with a single dataset like today can pass the same dataset via `--training_data_path <file> --validataion_data_path <file>` arguments.
214
214
215
-
Infact we do not change the behaviour currently supported by any of the `tuning.config.configs.DataArguments` arguments hence allowing the simple users of the library to continue using the library as is.
215
+
Infact we do not change the behavior currently supported by any of the `tuning.config.configs.DataArguments` arguments hence allowing the simple users of the library to continue using the library as is.
216
216
217
217
### Performance Considerations
218
218
219
-
Since this design allows complex preprocessing of the dataset on fly, the design should incorporate perfomrance measures to ensure that the system is not performing too slow or spending too much time while preprocessing the dataset to affect tuning/training time.
219
+
Since this design allows complex preprocessing of the dataset on fly, the design should incorporate performance measures to ensure that the system is not performing too slow or spending too much time while preprocessing the dataset to affect tuning/training time.
220
220
221
221
The goal that we have here is to not be slower than the HuggingFace library which our whole design is based upon, in this sense we also imagine any performance improvements that we come across to be contributed back to HF library to keep our design simple and not reimplement stuff.
222
222
223
-
-> Handling Large Dataset
223
+
#### Handling Large Dataset
224
224
225
225
Our main reason for using HF [Map](https://huggingface.co/docs/datasets/en/process#map) heavily for data preprocessing is that for large datasets which are generally loaded as `IterableDatasets` the MAP API automatically performs [`lazy map operations`](https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable#eager-data-processing-and-lazy-data-processing) and hence doesn't produce too much overhead while training.
226
226
227
-
-> Caching intermediate dataset
227
+
#### Caching intermediate dataset
228
228
229
229
Hugging Face caches intermediate map operations which makes replay of our data preprocessor easier if same map parameters and operations are applied. If the file system is an issue we have two considerations,
230
230
@@ -244,20 +244,20 @@ Leaving all users to write their own preprocessing logic can also lead to code d
244
244
245
245
More importantly as stated in the motivation we are getting ever increased demand from users who want to use this library directly with their dataset and have quick roundtrip for testing. This design allows users to specify simple parameters in the config and test for complex usecases easily.
246
246
247
-
### Passing all datasets we take to the Huggingface SFTTrainer api and let it handle them without preprocessing at our end.
247
+
### Passing all datasets we take to the HuggingFace SFTTrainer API and let it handle them without preprocessing at our end.
248
248
249
-
Another alternative we have is to take the `dataset` input to this library and pass it directly to the trainer `SFTrainer` in our case directly and let it handle loading and preprocessing the dataset.
249
+
Another alternative we have is to take the `dataset` input to this library and pass it directly to the trainer `SFTTrainer` in our case directly and let it handle loading and preprocessing the dataset.
250
250
251
-
[SFTrainer](https://huggingface.co/docs/trl/v0.12.1/en/sft_trainer#trl.SFTTrainer) supports specifying the `train_dataset` and `eval_dataset` for both of which it supports iterable datasets along with normal datasets allowing us to pass a large dataset supported via streaming.
251
+
[SFTTrainer](https://huggingface.co/docs/trl/v0.12.1/en/sft_trainer#trl.SFTTrainer) supports specifying the `train_dataset` and `eval_dataset` for both of which it supports Iterable datasets along with normal datasets allowing us to pass a large dataset supported via streaming.
252
252
253
-
Please not that even in this case users will need to tell us that the dataset is large and is to be loaded via `streaming=True` because the argument which tells HF to load the dataset in iterable mode or standard mode is passed to [`load_dataset`](https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/loading_methods#datasets.load_dataset)
253
+
Please note that even in this case users will need to tell us that the dataset is large and is to be loaded via `streaming=True` because the argument which tells HF to load the dataset in Iterable mode or standard mode is passed to [`load_dataset`](https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/loading_methods#datasets.load_dataset)
Additionally, `SFTrainer` has support for [data formatting function](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support). Users can pass a `formatting_function` directly to `SFTtrainer` which formats the dataset for them,
260
+
Additionally, `SFTTrainer` has support for [data formatting function](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support). Users can pass a `formatting_function` directly to `SFTtrainer` which formats the dataset for them,
261
261
262
262
```
263
263
def formatting_prompts_func(example):
@@ -278,14 +278,15 @@ trainer.train()
278
278
```
279
279
Taken from [HuggingFace docs](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support)
280
280
281
-
As our library is a wrapper on top of HF we cannot direclty allow users to pass a custom formatting function and
281
+
As our library is a wrapper on top of HF we cannot directly allow users to pass a custom formatting function and
282
282
our `data_handler` design can also support formatting dataset in a similar way to `formatting function` where users specify just name of the handler and we apply formatting on our end. The design for `data_handler` that we have is a superset of this feature which is more flexible and can support many more use cases.
283
283
284
284
## Consequences
285
285
286
286
### Arguments Required
287
287
In this design, apart from the `data_config` spec users will also need to pass the `--response_template` argument. This is because the `DataCollator` functionality of this library is not being touched by our design.
288
-
Also users need to specify `--dataset_text_field` which is infered from the `DataArguments` for now to ensure the simple interface remains same.
288
+
289
+
Also users who process JSON dataset via our interface need to specify `--dataset_text_field` which is inferred from the `DataArguments` for now and not passed inside the data_config to ensure the simple interface remains same.
289
290
290
291
We also plan to add a new argument to `tuning.config.configs.DataArguments` which takes in the `data_config` file as input. like,
291
292
```
@@ -380,7 +381,7 @@ can be implemented as data handlers.
380
381
### Interleaving datasets
381
382
382
383
In case of multiple datasets the user can request how the datasets are to be interleaved.
383
-
The probabilies specified by users in the config `sampling.ratio` can be collected from individual datasets and passed to
384
+
The probabilities specified by users in the config `sampling.ratio` can be collected from individual datasets and passed to
@@ -394,23 +395,23 @@ Further important thing to note is in case of HF, if we use hugging face the `ma
394
395
395
396
Data collators specifically for TRL use cases like chat based interactions which apply chat templates and proper attention masking on the tokenized data like in the case of `DataCollatorForCompletionOnlyLM` handle a specific functionality on the data.
396
397
397
-
In this design our approach is to pass data collators from hugging face api directly to SFTTrainer.
398
+
In this design our approach is to pass data collators from hugging face API directly to SFTTrainer.
398
399
399
400
In the current code path, collators are collected by `get_data_collator` functionality and passed to `SFTTrainer`. We can retain the same functionality and keep the design simpler.
400
401
401
402
The job of the data pre processor is to provide a single interface over multiple datasets in the config while keeping a collator like this means we will keep the collator same across all datasets but keeps the design simpler.
402
403
403
404
## Handling Multi Modal Data.
404
405
405
-
HF does provide support for handling [image datasets](https://huggingface.co/docs/datasets/en/image_process) and [audio datasets](https://huggingface.co/docs/datasets/en/audio_load) which can be utilised by us in our HF datapreprocessor.
406
+
HF does provide support for handling [image datasets](https://huggingface.co/docs/datasets/en/image_process) and [audio datasets](https://huggingface.co/docs/datasets/en/audio_load) which can be utilized by us in our HF datapreprocessor.
406
407
407
408
The functionality listed by HF in implementing the use of image and audio datasets is `map` based functions to perform resize, encoding and other such operations on the dataset (see the link above).
408
409
409
410
This means the image and audio multi modal datasets will be compatible with our data handler routines. Once we implement the data handler routine processing, we will allow users to train with multi modal datasets too.
410
411
411
412
# Implementing stages.
412
413
1. Stage 1:
413
-
* Refactoring the code in `fms-hf-tuning` into the abstract data class and adding support for preliminery data handling routines.
414
+
* Refactoring the code in `fms-hf-tuning` into the abstract data class and adding support for preliminary data handling routines.
414
415
This will automatically enable support for multi modal data which is our priority.
415
416
Note at this stage it might be wise to have two side by side implementations, i.e. not deleting the existing implementation.
0 commit comments