Skip to content

Commit 8710f68

Browse files
committed
add first cut dataloader v2 ADR
Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
1 parent d58faa5 commit 8710f68

File tree

1 file changed

+274
-0
lines changed

1 file changed

+274
-0
lines changed
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
# Data Pre Processor Design For fms-hf-tuning
2+
3+
**Deciders(s)**: Sukriti Sharma (sukriti.sharma4@ibm.com), Will Johnson (Will.Johnson@ibm.com) , Abhishek Maurya (maurya.abhishek@ibm.com), Yu Chin Fabian Lim (flim@sg.ibm.com), Dushyant Behl (dushyantbehl@in.ibm.com), Ashok Pon Kumar (ashokponkumar@in.ibm.com)
4+
5+
**Date (YYYY-MM-DD)**: 2024-03-06
6+
7+
**Obsoletes ADRs**: NA
8+
9+
**Modified By ADRs**: NA
10+
11+
**Relevant Issues**: [1]
12+
13+
- [Summary and Objective](#summary-and-objective)
14+
- [Motivation](#motivation)
15+
- [User Benefit](#user-benefit)
16+
- [Decision](#decision)
17+
- [Alternatives Considered](#alternatives-considered)
18+
- [Consequences](#consequences)
19+
- [Detailed Design](#detailed-design)
20+
21+
## Summary and Objective
22+
23+
The reason for motivating datapreprocessor design for fms-hf-tuning is to have a unified interface which supports many type of data formats, streaming and non streaming data, weight based data mixing and many others.
24+
25+
1. Support for different formats of data → Arrow, Parquet, CSV, etc.
26+
27+
1. Support for multiple files in a dataset.
28+
29+
1. Support for multiple datasets.
30+
31+
1. Support for different modalities of data -> Images, Audio etc.
32+
33+
1. Support for mixing datasets based on static weights.
34+
35+
1. Support for streaming datasets (or Iterable Datasets in HuggingFace)
36+
37+
1. Support for chat template based masking
38+
39+
1. Tool Usage - for example using jinja templates in the datasets which require preprocessing of the data.
40+
41+
### Motivation
42+
43+
The current design of the data processing in this library contains predefined use cases and support for a simple interface
44+
to the users, while the simple interface provides ease of use over hugging face apis it doesn't provide users any interface to design custom data preprocessing for e.g. in case of multi modal support.
45+
Ths motivation for this design is to implement the data preprocessor in a flexible way and expose to the users a powerful api
46+
which can allow them custom processing of data however they want at the same time retaining the simple interface of the api.
47+
48+
### User Benefit
49+
50+
### Simple User Perspective
51+
52+
For simple and base users of the code we want to retain the same functionality wherever possible i.e,
53+
allow users to pass in a single data file and perform simple preprocessing.
54+
This means retaining the arguments to the library which exist currently and ensuring that the appropriate
55+
processing required on the data is handeled internally.
56+
57+
### Advanced User Perspective
58+
59+
For advanced users we want to open up an argument to our library `data_config_file` which will take input
60+
a data preprocesing config file and can specify what preprocessing to apply on the data and in what order.
61+
Here our goal is not to reimplement the functionality provided by hugging face but rather have a clean interface
62+
using a config where advanced users can use advanced HF functions like splitting a dataset or perform custom
63+
preprocessing like applying jinja templates etc.
64+
65+
The input spec which user specifies on how to pass information for such preprocessing is this
66+
67+
```
68+
datapreprocessor:
69+
streaming: true
70+
datasets:
71+
- name: dataset1
72+
sampling:
73+
ratio: 0.3
74+
data_paths:
75+
- /data/stackoverflow-kubectl_posts
76+
- /data/stackoverflow-kubernetes_posts
77+
- /data/stackoverflow-openshift_posts
78+
data_handlers:
79+
- name: render_template
80+
arguments:
81+
remove_columns: all
82+
batched: false
83+
fn_kwargs:
84+
jinja_template: "{<jinja-template>}"
85+
- name: dataset1
86+
sampling:
87+
ratio: 0.4
88+
data_paths:
89+
- /data/stackoverflow-kubectl_posts
90+
- /data/stackoverflow-kubernetes_posts
91+
- /data/stackoverflow-openshift_posts
92+
data_handlers:
93+
- name: render_template
94+
arguments:
95+
remove_columns: all
96+
batched: false
97+
fn_kwargs:
98+
jinja_template: "{<jinja-template>}"
99+
- name: dataset2
100+
sampling:
101+
ratio: 0.3
102+
data_handlers:
103+
- name: apply_tokenizer_chat_template
104+
arguments:
105+
remove_columns: all
106+
batched: false
107+
data_files:
108+
- /data/stackoverflow-kubectl_posts.jsonl
109+
- /data/stackoverflow-kubernetes_posts.jsonl
110+
```
111+
112+
### Intermediate User Perspective
113+
Our perspective is that the advanced users will create config files for data preprocessing and the intermediate users can use these existing configs and modify them according to their preference to get the desired result.
114+
115+
## Detailed Design
116+
117+
### The proposed design to implement support for this spec is follows,
118+
119+
Config Representation in code.
120+
121+
```
122+
@dataclass
123+
class DataHandlerConfig:
124+
name: str
125+
arguments: Optional[Dict]
126+
127+
@dataclass
128+
class DatasetConfig:
129+
name: str
130+
sampling: Optional[Dict] = None
131+
splitter_arguments: Optional[Dict] = None
132+
data_paths: List[str]
133+
data_handlers: List[DataHandlerConfig] = None
134+
135+
@dataclass
136+
class DataPreProcessorConfig:
137+
streaming: Optional[bool] = None
138+
139+
@dataclass
140+
class DataConfig:
141+
datapreprocessor: Optional[DataPreProcessorConfig]
142+
datasets: List[DatasetConfig]
143+
```
144+
145+
Data Pre Processor abstract class
146+
147+
```
148+
class DataPreProcessor(ABC):
149+
150+
tokenizer = None
151+
model_name_or_path = None
152+
block_size = None
153+
data_config: DataConfig = None
154+
data_handlers: Dict[str, Callable] = None
155+
156+
def __init__(self, dataconfig: DataConfig, tokenizer, model_name_or_path, block_size):
157+
self.data_config = dataconfig
158+
self.tokenizer = tokenizer
159+
self.model_name_or_path = model_name_or_path
160+
self.block_size = block_size
161+
self.data_handlers = {}
162+
163+
def register_data_handler(self, name: str, d: Callable):
164+
self.data_handlers[name] = d
165+
166+
@abstractmethod
167+
def process_data_config(self, data_config: DataConfig):
168+
pass
169+
```
170+
171+
At the top level we propose to have this `class DataPreProcessor` which is an abstract class
172+
and requires functions to process the data config proposed above.
173+
174+
We also propose a full length config verification code which preceeds the call to function
175+
`DataPreProcessor.process_data_config` as the function expects a `DataConfig` object.
176+
177+
The data pre processor needs to support custom data handlers which are provided by users of the library
178+
or even predefined handlers which need to be registered with the top level class using the
179+
call `DataPreProcessor.register_data_handler`.
180+
181+
## How are handlers provided and registered -
182+
183+
Data handlers are python callables which can be called on single/few samples of data and can perform
184+
things like applying chat template, tokenising the data, applying tools like jinja template or even
185+
things like encoding or decoding multi modal formats like images/audio for processing by the model.
186+
187+
The abstract datapreprocessor class provides a way to register datahandler against a `name` which is a string.
188+
The data handler config `DataHandlerConfig` taken by `execute_data_handlers` represents a DAG of data handling
189+
routines which are to be executed on the data.
190+
191+
For standard HF API you can think of these as the HF Processing routines. Which could be Map/Filter/Select operations
192+
We implement most of the routines as map and because of this even the tokenisation of data which is done today
193+
in fms-hf-tuning via `tuning/utils/preprocessing_utils.py::get_preprocessed_dataset` can be retained as a data
194+
handler which performs tokenization.
195+
196+
The implementation is flexible enough for very advanced users to specify their own implementation of data handling routines by importing fms-hf-tuning and extending the preprocessing by calling `register_data_handler` on the preprocessor. This is left for advanced users of the library and not for simple users.
197+
198+
To this end, one way to design is we can provide the users and API on like the one shown in the `DataPreProcessor` class
199+
which they can utilise to register custom data handlers, in this case however the user needs to use `fms-hf-tuning` as
200+
a module but not via the implementation of its `main` functionality.
201+
202+
Please note that our implementation needs to support certain predefined built-in handlers like `apply_chat_template`
203+
or `tokenize` which user can request just by a name.
204+
205+
For example see this implementation - https://github.ibm.com/ai4code-wisdom/platform/blob/main/modelops/modelops/train.py#L251
206+
207+
# Implementation of the default Data Preprocessor.
208+
209+
The default data preprocessor implemented as an instance of the `DataPreProcessor` class uses HF APIs where ever possible
210+
to miminize custom reimplementation of code.
211+
212+
When the datapreprocessor goes through each `DataSetConfig`
213+
214+
The HF datapreprocessor processes different type of files via its `load_dataset` factory.
215+
If not supported automatically via this, we can look to extend the factory to use an other type of interest via
216+
`Dataset.from_generator(<generator>)` functionality.
217+
218+
This also means that any implementation like `get_json_object` which load `json(l)` and then return a custom json dict
219+
can be implemented as data handlers.
220+
221+
### Interleaving datasets
222+
223+
In case of multiple datasets the user can request how the datasets are to be interleaved.
224+
The probabilies specified by users in the config `sampling.ratio` can be collected from individual datasets and passed to
225+
[`datasets.interleave_datasets`](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.interleave_datasets).
226+
227+
### Streaming datasets
228+
229+
In HuggingFace the `streaming` argument can be handled by using `IterableDatasets` instead of standard `Datasets`.
230+
HF provides same APIs like `datasets.interleave_datasets` over the `Iterable` datasets as well.
231+
232+
Further important thing to note is in case of HF, if we use hugging face the `map` functionality which we use to implement data handling is handled in a lazy fashion meaning we don't need to handle the data handlers in a different way for streaming data. [More Information on HF Page.](https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable#eager-data-processing-and-lazy-data-processing)
233+
234+
## Handling data collators.
235+
236+
Data collators specifically for TRL use cases like chat based interactions which apply chat templates and proper attention masking on the tokenized data like in the case of `DataCollatorForCompletionOnlyLM` handle a specific functionality on the data. In this design our approach is to pass data collators from hugging face api directly to SFTTrainer.
237+
Retaining the current code path, the code collators are collected by `get_data_collator` functionality and passed to `SFTTrainer`. We can retain the same functionality and keep the design simpler.
238+
The job of the data pre processor is to provide a single interface over multiple datasets in the config while keeping a collator like this means we will keep the collator same across all datasets but keeps the design simpler.
239+
240+
## Simplification of code and user configuration
241+
242+
The flexibility provided by this design is that it simplifies the configuration requirement for various use cases.
243+
If chat template and chat style data is requested users can specify chat specific data handlers and not specify all configurations which are not required.
244+
This can also simplify configuration handling in the code. TODO: give example
245+
246+
## Handling Multi Modal Data.
247+
248+
HF does provide support for handling [image datasets](https://huggingface.co/docs/datasets/en/image_process) and [audio datasets](https://huggingface.co/docs/datasets/en/audio_load) which can be utilised by us in our HF datapreprocessor.
249+
250+
The functionality listed by HF in implementing the use of image and audio datasets is `map` based functions to perform resize, encoding and other such operations on the dataset (see the link above).
251+
252+
This means the image and audio multi modal datasets will be compatible with our data handler routines. Once we implement the data handler routine processing, we will allow users to train with multi modal datasets too.
253+
254+
### Alternatives Considered
255+
256+
## Consequences
257+
258+
### Advantages
259+
260+
### Impact on performance
261+
262+
263+
# Implementing stages.
264+
265+
1. Stage 1:
266+
* Refactoring the code in `fms-hf-tuning` into the abstract data class and adding support for preliminery data handling routines.
267+
This will automatically enable support for multi modal data which is our priority.
268+
Note at this stage it might be wise to have two side by side implementations, i.e. not deleting the existing implementation.
269+
1. State 2:
270+
* Implementing `streaming` data or `iterable` dataset support for the HF datapreprocessor implementation.
271+
* Data handling support for streaming data
272+
1. State 3:
273+
* Identify and add any other required predefined data handlers.
274+
* Phase out the old implementation in support of the new one.

0 commit comments

Comments
 (0)