Skip to content

Commit 01fecad

Browse files
dushyantbehlwillmj
andcommitted
Update architecture_records/004-datapreprocessor.md
Co-authored-by: Will Johnson <[email protected]> Signed-off-by: Dushyant Behl <[email protected]>
1 parent 21ceb10 commit 01fecad

File tree

1 file changed

+21
-20
lines changed

1 file changed

+21
-20
lines changed

architecture_records/004-datapreprocessor.md

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ This interface should cater to various user expertise levels, enabling basic use
3232

3333
### Motivation
3434

35-
The main motivation for this ADR stems from the fact that fms-hf-tuning is being used by many teams for a diverse set of use cases which are not currently supported in the library. To be precise, currently in the library for data preposssing we currently take two primary arguments `training_data_path` and `validataion_data_path` which take in a single file location for a dataset.
35+
The main motivation for this ADR stems from the fact that fms-hf-tuning is being used by many teams for a diverse set of use cases which are not currently supported in the library. To be precise, currently in the library for data preprocessing we currently take two primary arguments `training_data_path` and `validataion_data_path` which take in a single file location for a dataset.
3636
A user can currently pass in
3737
1. a pretokenized json(l) dataset via
3838
```
@@ -55,7 +55,7 @@ The first motivation for a change is the requirements from users asking for diff
5555
5656
Also use cases from teams require multiple datasets and even multiple data files in a dataset.
5757
58-
Futher requirements from teams is to have a way to interleave datasets at run time by specifying static weights to mix different datasets which is also not supported by the code yet.
58+
Further requirements from teams is to have a way to interleave datasets at run time by specifying static weights to mix different datasets which is also not supported by the code yet.
5959
6060
Finally other requirements are to have preprocesing support for multiple modalities of data (starting with Image first) and have support for advanced preprocesing like jinja based template rendering of the dataset before consumption.
6161
@@ -133,7 +133,7 @@ Please note that most of the users of product here would fall into the simple us
133133
1. Ensure the single design can handle these and many more use cases without major changes.
134134
1. Design for Advanced users while simplify for simple users.
135135
136-
We propose to allow advanced users to specify a full spec which exposes data preprocessing API provided by the HF library directly to them to be able to fully utilise the interface.
136+
We propose to allow advanced users to specify a full spec which exposes data preprocessing API provided by the HF library directly to them to be able to fully utilize the interface.
137137
138138
The proposed input spec which user specifies as `data_config` on how to pass information for such preprocessing is
139139
@@ -173,7 +173,7 @@ datasets:
173173
batched: false
174174
```
175175
176-
To iterate again, here our goal is not to reimplement the functionality provided by HuggingFace but rather have a clean interface using a config where advanced users can use things like Iterable datasets or Interleaving datasets and perform custom preprocessing like applying jinja templates etc in an easy way.
176+
To iterate again, here our goal is not to re-implement the functionality provided by HuggingFace but rather have a clean interface using a config where advanced users can use things like Iterable datasets or Interleaving datasets and perform custom preprocessing like applying jinja templates etc in an easy way.
177177
178178
In this spec, at top level we have the `Dataprocessor` config which contains just one field `type` which is set to `default`. This is done to ensure any future top level `dataprocessor` configs will go into this block. Users need not touch or provide this as the `default` is automatically selected.
179179
@@ -208,23 +208,23 @@ By allowing the users to specify data handlers like this we allow them to use fu
208208
209209
Furthermore this design allows flexibility to be extended to any upcoming usecase because any operation to be executed on the dataset can be broken down into function execution implemented as data handlers.
210210
211-
This makes our spec a complete solution for advanced users of the library allowing them to specify complete preprocessing operations to be applied to the dataset via a config file.
211+
This makes our spec a complete solution for advanced users of the library, who have custom preprocessing needs. Allowing them to specify complete preprocessing operations to be applied to the dataset via a config file.
212212
213213
Finally, with this spec we do not want to break the functionality for the simple users of the library. A simple user which wants to just use the library with a single dataset like today can pass the same dataset via `--training_data_path <file> --validataion_data_path <file>` arguments.
214214
215-
Infact we do not change the behaviour currently supported by any of the `tuning.config.configs.DataArguments` arguments hence allowing the simple users of the library to continue using the library as is.
215+
Infact we do not change the behavior currently supported by any of the `tuning.config.configs.DataArguments` arguments hence allowing the simple users of the library to continue using the library as is.
216216
217217
### Performance Considerations
218218
219-
Since this design allows complex preprocessing of the dataset on fly, the design should incorporate perfomrance measures to ensure that the system is not performing too slow or spending too much time while preprocessing the dataset to affect tuning/training time.
219+
Since this design allows complex preprocessing of the dataset on fly, the design should incorporate performance measures to ensure that the system is not performing too slow or spending too much time while preprocessing the dataset to affect tuning/training time.
220220
221221
The goal that we have here is to not be slower than the HuggingFace library which our whole design is based upon, in this sense we also imagine any performance improvements that we come across to be contributed back to HF library to keep our design simple and not reimplement stuff.
222222
223-
-> Handling Large Dataset
223+
#### Handling Large Dataset
224224
225225
Our main reason for using HF [Map](https://huggingface.co/docs/datasets/en/process#map) heavily for data preprocessing is that for large datasets which are generally loaded as `IterableDatasets` the MAP API automatically performs [`lazy map operations`](https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable#eager-data-processing-and-lazy-data-processing) and hence doesn't produce too much overhead while training.
226226
227-
-> Caching intermediate dataset
227+
#### Caching intermediate dataset
228228
229229
Hugging Face caches intermediate map operations which makes replay of our data preprocessor easier if same map parameters and operations are applied. If the file system is an issue we have two considerations,
230230
@@ -244,20 +244,20 @@ Leaving all users to write their own preprocessing logic can also lead to code d
244244
245245
More importantly as stated in the motivation we are getting ever increased demand from users who want to use this library directly with their dataset and have quick roundtrip for testing. This design allows users to specify simple parameters in the config and test for complex usecases easily.
246246
247-
### Passing all datasets we take to the Huggingface SFTTrainer api and let it handle them without preprocessing at our end.
247+
### Passing all datasets we take to the HuggingFace SFTTrainer API and let it handle them without preprocessing at our end.
248248
249-
Another alternative we have is to take the `dataset` input to this library and pass it directly to the trainer `SFTrainer` in our case directly and let it handle loading and preprocessing the dataset.
249+
Another alternative we have is to take the `dataset` input to this library and pass it directly to the trainer `SFTTrainer` in our case directly and let it handle loading and preprocessing the dataset.
250250
251-
[SFTrainer](https://huggingface.co/docs/trl/v0.12.1/en/sft_trainer#trl.SFTTrainer) supports specifying the `train_dataset` and `eval_dataset` for both of which it supports iterable datasets along with normal datasets allowing us to pass a large dataset supported via streaming.
251+
[SFTTrainer](https://huggingface.co/docs/trl/v0.12.1/en/sft_trainer#trl.SFTTrainer) supports specifying the `train_dataset` and `eval_dataset` for both of which it supports Iterable datasets along with normal datasets allowing us to pass a large dataset supported via streaming.
252252
253-
Please not that even in this case users will need to tell us that the dataset is large and is to be loaded via `streaming=True` because the argument which tells HF to load the dataset in iterable mode or standard mode is passed to [`load_dataset`](https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/loading_methods#datasets.load_dataset)
253+
Please note that even in this case users will need to tell us that the dataset is large and is to be loaded via `streaming=True` because the argument which tells HF to load the dataset in Iterable mode or standard mode is passed to [`load_dataset`](https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/loading_methods#datasets.load_dataset)
254254
255255
```
256256
from datasets import load_dataset
257257
train_ds = load_dataset('imdb', split='train', streaming=True)
258258
```
259259
260-
Additionally, `SFTrainer` has support for [data formatting function](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support). Users can pass a `formatting_function` directly to `SFTtrainer` which formats the dataset for them,
260+
Additionally, `SFTTrainer` has support for [data formatting function](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support). Users can pass a `formatting_function` directly to `SFTtrainer` which formats the dataset for them,
261261
262262
```
263263
def formatting_prompts_func(example):
@@ -278,14 +278,15 @@ trainer.train()
278278
```
279279
Taken from [HuggingFace docs](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support)
280280
281-
As our library is a wrapper on top of HF we cannot direclty allow users to pass a custom formatting function and
281+
As our library is a wrapper on top of HF we cannot directly allow users to pass a custom formatting function and
282282
our `data_handler` design can also support formatting dataset in a similar way to `formatting function` where users specify just name of the handler and we apply formatting on our end. The design for `data_handler` that we have is a superset of this feature which is more flexible and can support many more use cases.
283283
284284
## Consequences
285285
286286
### Arguments Required
287287
In this design, apart from the `data_config` spec users will also need to pass the `--response_template` argument. This is because the `DataCollator` functionality of this library is not being touched by our design.
288-
Also users need to specify `--dataset_text_field` which is infered from the `DataArguments` for now to ensure the simple interface remains same.
288+
289+
Also users who process JSON dataset via our interface need to specify `--dataset_text_field` which is inferred from the `DataArguments` for now and not passed inside the data_config to ensure the simple interface remains same.
289290
290291
We also plan to add a new argument to `tuning.config.configs.DataArguments` which takes in the `data_config` file as input. like,
291292
```
@@ -380,7 +381,7 @@ can be implemented as data handlers.
380381
### Interleaving datasets
381382
382383
In case of multiple datasets the user can request how the datasets are to be interleaved.
383-
The probabilies specified by users in the config `sampling.ratio` can be collected from individual datasets and passed to
384+
The probabilities specified by users in the config `sampling.ratio` can be collected from individual datasets and passed to
384385
[`datasets.interleave_datasets`](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.interleave_datasets).
385386
386387
### Streaming datasets
@@ -394,23 +395,23 @@ Further important thing to note is in case of HF, if we use hugging face the `ma
394395
395396
Data collators specifically for TRL use cases like chat based interactions which apply chat templates and proper attention masking on the tokenized data like in the case of `DataCollatorForCompletionOnlyLM` handle a specific functionality on the data.
396397
397-
In this design our approach is to pass data collators from hugging face api directly to SFTTrainer.
398+
In this design our approach is to pass data collators from hugging face API directly to SFTTrainer.
398399
399400
In the current code path, collators are collected by `get_data_collator` functionality and passed to `SFTTrainer`. We can retain the same functionality and keep the design simpler.
400401
401402
The job of the data pre processor is to provide a single interface over multiple datasets in the config while keeping a collator like this means we will keep the collator same across all datasets but keeps the design simpler.
402403
403404
## Handling Multi Modal Data.
404405
405-
HF does provide support for handling [image datasets](https://huggingface.co/docs/datasets/en/image_process) and [audio datasets](https://huggingface.co/docs/datasets/en/audio_load) which can be utilised by us in our HF datapreprocessor.
406+
HF does provide support for handling [image datasets](https://huggingface.co/docs/datasets/en/image_process) and [audio datasets](https://huggingface.co/docs/datasets/en/audio_load) which can be utilized by us in our HF datapreprocessor.
406407
407408
The functionality listed by HF in implementing the use of image and audio datasets is `map` based functions to perform resize, encoding and other such operations on the dataset (see the link above).
408409
409410
This means the image and audio multi modal datasets will be compatible with our data handler routines. Once we implement the data handler routine processing, we will allow users to train with multi modal datasets too.
410411
411412
# Implementing stages.
412413
1. Stage 1:
413-
* Refactoring the code in `fms-hf-tuning` into the abstract data class and adding support for preliminery data handling routines.
414+
* Refactoring the code in `fms-hf-tuning` into the abstract data class and adding support for preliminary data handling routines.
414415
This will automatically enable support for multi modal data which is our priority.
415416
Note at this stage it might be wise to have two side by side implementations, i.e. not deleting the existing implementation.
416417
1. State 2:

0 commit comments

Comments
 (0)