Skip to content

Commit d563c20

Browse files
committed
update adr
1 parent 326b644 commit d563c20

File tree

1 file changed

+156
-60
lines changed

1 file changed

+156
-60
lines changed

architecture_records/004-datapreprocessor.md

Lines changed: 156 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,6 @@
2020

2121
## Summary and Objective
2222

23-
<!-->
24-
Context goes here.
25-
Describe the forces at play, including technological, political, social, and project local. These forces are likely in tension, and should be called out as such. The language in this section is value-neutral. It is simply describing facts.
26-
<-->
27-
2823
The primary objective of the `DataPreProcessor` design for fms-hf-tuning is to provide a unified yet powerful interface for handling diverse data formats and configurations.
2924
This interface should cater to various user expertise levels, enabling basic users to easily load and process data, while allowing advanced users to customize their data handling through pre-defined configuration files.
3025

@@ -37,75 +32,68 @@ This interface should cater to various user expertise levels, enabling basic use
3732

3833
### Motivation
3934

40-
<!-->
41-
Why this is a valuable problem to solve? What background information is needed to show how this design addresses the problem?
42-
Which users are affected by the problem? Why is it a problem? What data supports this? What related work exists?
43-
<-->
44-
4535
The main motivation for this ADR stems from the fact that fms-hf-tuning is being used by many teams for a diverse set of use cases which are not currently supported in the library.
4636

4737
In the library for data preposssing we currently take two primary arguments `training_data_path` and `validataion_data_path` which take in a single file location for a dataset. Current library supports only `Json` data but can handle both pretokenised or non tokenised data by performing `input` masking and custom data formatting.
4838

4939
The first motivation for a change is the requirements from users asking for multiple datasets and even multiple data files in a dataset. Also there are teams which are training using
5040
Parquet and Arrow format data so they require support for additional data formats in the code.
5141
Futher requirements from teams is to have a way to interleave datasets at run time by specifying static weights to mix different datasets which is also not supported by the code yet.
52-
Finally other requirements are to have preprocesing support for multiple modalities of data (starting with Image first) and have support for custom preprocesing like jinja based template rendering of the dataset before consumption.
42+
Finally other requirements are to have preprocesing support for multiple modalities of data (starting with Image first) and have support for advanced preprocesing like jinja based template rendering of the dataset before consumption.
5343

5444
All these requirements are new and are currently not supported by the library which motivated us to propose a change in the design of data preprocesing in this library to incorporate these and potentially any new changes in one go.
5545

5646
### User Benefit
5747

58-
<!-- How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post? -->
48+
Users will benefit from the additional argument which allows them to pass a single `data_config` file specifying how to preprocess their dataset.
49+
Our data confi file will extend users the capabilities to,
50+
1. Pass multiple data files and multiple datasets.
51+
1. Specify static weights in the configuration to interleave datasets.
52+
1. Define preprocessing routines to apply on the data and in which order
5953

60-
Users will benefit from the additional argument which allows users to pass a single `data_config` file specifying how to preprocess their dataset.
61-
In the config, users will be able to pass multiple data files and multiple datasets and specify static weights in the configuration to interleave datasets.
62-
In the config, users will also be able to define which preprocessing routines to apply on the data and in which order making the process of handling custom datasets
63-
which might require rendering jinja template or processing image data in a custom way much easier.
54+
This will make the process of handling custom datasets which might require rendering jinja template or processing image data way much easier.
6455

65-
Its not mandatory for users to learn the specification of the additional `data_config` as the existing arguments to process data as present in the code `tuning.config.configs.DataArguments` will not be deprecated and users can keep using the same data arguments for use cases being served by the library currently.
56+
We do not require users to learn the specification of the additional `data_config` file, as the existing arguments to process dataset which are present in the code `tuning.config.configs.DataArguments` will not be deprecated in this version and users can keep using the same data arguments for use cases being served by the library.
6657

6758
## Decision
6859

69-
<!-->
70-
This is the meat of the document, where you explain the decision. If you have multiple alternatives, be sure to use sub-sections for better separation of the idea, and list pros/cons to each approach. If there are alternatives that you have eliminated, you should also list those here, and explain why you believe your chosen approach is superior. Make sure you’ve thought through and addressed the following sections. If a section is not relevant to your specific proposal, please explain why, e.g. your ADR addresses a convention or process, not an API.
71-
<-->
72-
73-
The primary decision at our hand for this ADR is to consider how to handle the incoming use cases. The current code as explained above handles limited data format (only `json`) and limited preprocessing `custom data formatting` with or without a data template, `tokenization and input masking`.
74-
75-
One way to handle a set number of usecases is to have use case specific implementation of data pre processing and let users choose which preprocessing to utilise via
76-
existing or new commandline arguments.
77-
78-
### Alternatives Considered
79-
<!-->
80-
81-
Make sure to discuss the relative merits of alternatives to your proposal.
82-
<-->
60+
Some terminology before we move ahead
8361

84-
## Consequences
62+
<table>
63+
<tr>
64+
<th style="border: 1px solid black; padding: 5px;">User Persona</th>
65+
<th style="border: 1px solid black; padding: 5px;">Description</th>
66+
</tr>
67+
<tr>
68+
<td style="border: 1px solid black; padding: 5px;">Simple User</td>
69+
<td style="border: 1px solid black; padding: 5px;">A user who uses this library to train models using a single dataset, passing it via a single command line argument.</td>
70+
</tr>
71+
<tr>
72+
<td style="border: 1px solid black; padding: 5px;">Advanced User</td>
73+
<td style="border: 1px solid black; padding: 5px;">A user with a deep understanding of datasets, who knows how to apply specific preprocessing and mixing techniques during training.</td>
74+
</tr>
75+
<tr>
76+
<td style="border: 1px solid black; padding: 5px;">Intermediate User</td>
77+
<td style="border: 1px solid black; padding: 5px;">A user who works with custom datasets but lacks full knowledge of data processing, relying on advanced users for guidance to fulfill their use case.</td>
78+
</tr>
79+
</table>
8580

86-
<!-->
87-
Describe the resulting context, after applying the decision. All consequences should be listed here, not just the "positive" ones. A particular decision may have positive, negative, and neutral consequences, but all of them affect the team and project in the future.
88-
<-->
81+
Please note that most of the users of product here would fall into the simple user category while advanced and intermediate users are researchers looking to use our library for diverse set of use cases.
8982

83+
### Our considerations for the design here are
9084

91-
### Simple User Perspective
92-
93-
For simple and base users of the code we want to retain the same functionality wherever possible i.e,
94-
allow users to pass in a single data file and perform simple preprocessing.
85+
1. Allow advanced users to use full power of the huggingface library as much as possible without recreating the same.
86+
1. Allow advanced users to specify custom data preprocessor pipeline in an easy way.
87+
1. Ensure the single design can handle these and many more use cases without major changes.
88+
1. Design for Advanced users while simplify for simple users.
9589

96-
This means retaining the data arguments to the library which exist currently and ensuring that the appropriate processing required on the data is handeled internally.
90+
We propose to allow advanced users to specify a full spec which exposes data preprocessing API provided by the HF library directly to them to be able to fully utilise the interface.
9791

98-
### Advanced User Perspective
99-
100-
For advanced users we want to open up an argument to our library `data_config_file` which will take as input a data preprocesing config file specifying what preprocessing to apply on the data and in what order.
101-
102-
Here our goal is not to reimplement the functionality provided by HuggingFace but rather have a clean interface using a config where advanced users can use advanced HF functions like Iterable datasets or Interleaving datasets and perform custom preprocessing like applying jinja templates etc.
103-
104-
The proposed input spec which user specifies on how to pass information for such preprocessing is this
92+
The proposed input spec which user specifies as `data_config` on how to pass information for such preprocessing is
10593

10694
```
10795
datapreprocessor:
108-
streaming: true
96+
type: default
10997
datasets:
11098
- name: dataset1
11199
sampling:
@@ -121,7 +109,7 @@ datasets:
121109
batched: false
122110
fn_kwargs:
123111
jinja_template: "{<jinja-template>}"
124-
- name: dataset1
112+
- name: dataset2
125113
sampling:
126114
ratio: 0.4
127115
data_paths:
@@ -135,26 +123,135 @@ datasets:
135123
batched: false
136124
fn_kwargs:
137125
jinja_template: "{<jinja-template>}"
138-
- name: dataset2
126+
```
127+
128+
To iterate again, here our goal is not to reimplement the functionality provided by HuggingFace but rather have a clean interface using a config where advanced users can use advanced HF functions like Iterable datasets or Interleaving datasets and perform custom preprocessing like applying jinja templates etc.
129+
130+
In this spec, at top level we have the `Dataprocessor` config which contains just one field `type` which is set to `default`. This is done to ensure any future top level `dataprocessor` configs will go into this block. Users need not touch or provide this as the `default` is automatically selected.
131+
132+
The second block here is where users will list multiple `datasets` and each dataset will contain information on how to process it. We allow arguments like `sampling` for users to specify sampling ratios while `interleaving datasets` to use API like `interleave_datasets` by HuggingFace.
133+
134+
The most powerful feature of this block is `data_handlers`. Here we allow users to specify a list of routines to apply on the dataset at the time of preprocessing. A `data_handler` is a `map`{ref HFmaps} operation performed on the dataset to which a user can further pass informational arguments. We expose the full set of arguments of a HF `map` operation here to the user as `kwargs` of a handler.
135+
136+
As example in `dataset2` the data handler is requesting to apply a `render_template` function on the dataset which processes the dataset and renders the `jinja template` specified as `fn_kwargs.jinja_template`, rest of the arguments like `remove_column` and `batched` are just HF Map API arguments.
137+
138+
```
139+
- name: dataset1
139140
sampling:
140-
ratio: 0.3
141+
ratio: 0.4
142+
data_paths:
143+
- /data/stackoverflow-kubectl_posts
144+
- /data/stackoverflow-kubernetes_posts
145+
- /data/stackoverflow-openshift_posts
141146
data_handlers:
142-
- name: apply_tokenizer_chat_template
147+
- name: render_template
143148
arguments:
144149
remove_columns: all
145150
batched: false
146-
data_files:
147-
- /data/stackoverflow-kubectl_posts.jsonl
148-
- /data/stackoverflow-kubernetes_posts.jsonl
151+
fn_kwargs:
152+
jinja_template: "{<jinja-template>}"
149153
```
150154

155+
By allowing the users to specify data handlers like this we allow them to use full Hugging Face API and at the same time specify preprocessing routines in a fixed order. The handlers list specify a `DAG` of operations to apply on the dataset and will be executed by the code in that order.
156+
157+
Furthermore this design allows flexibility to be extended to any upcoming usecase because any operation to be executed on the dataset can be broken down into function execution implemented as data handlers.
158+
159+
This makes our spec a complete solution for advanced users of the library allowing them to specify complete preprocessing operations to be applied to the dataset via a config file.
160+
161+
Finally, with this spec we do not want to break the functionality for the simple users of the library. A simple user which wants to just use the library with a single dataset like today can pass the same dataset via
162+
163+
```
164+
--training_data_path <file> --validataion_data_path <file>
165+
```
166+
167+
arguments. Infact we do not change the behaviour currently supported by any of the `tuning.config.configs.DataArguments` arguments hence allowing the simple users of the library to continue using the library as is.
168+
169+
### Alternatives Considered
170+
171+
1. Letting users process their own data and pass file(s) directly to this library.
172+
173+
A simple alternative to avoid all this is to have the users process their own data, this is also in lines of the fact that most
174+
workloads contain preprocessed data which is used by simple users as is for their tuning/training.
175+
176+
Many users coming to this library have advanced set of use cases. Other Researchers looking to use this library are looking for features like `jinja template` rendering, image data processing, mixing and merging datasets. While this can be done at user level most users are not looking to write code to do all this preprocessing but use tools which implement them to perform these tasks.
177+
At the same time this leads to code duplication across many teams which is something we want to avoid.
178+
179+
More importantly as stated in the motivation we are getting ever increased demand from users who want to use this library directly with their dataset and have quick roundtrip for testing. This design allows users to specify simple parameters in the config and test for complex usecases easily.
180+
181+
1. Passing all datasets we take to the Huggingface SFTTrainer api and let it handle them without preprocessing at our end.
182+
183+
Another alternative we have is to take the `dataset` input to this library and pass it directly to the trainer `SFTrainer` in our case directly and let it handle loading and preprocessing the dataset.
184+
185+
In [SFTrainer](https://huggingface.co/docs/trl/v0.12.1/en/sft_trainer#trl.SFTTrainer) apart from specifying the `train_dataset` and `eval_dataset` for both of which `SFTtrainer` supports iterable datasets as well so we can ideally pass a large dataset which should be supported via streaming.
186+
187+
Please not that even in this case users will need to tell us that the dataset is large and is to be loaded via `streaming=True` because the argument which tells HF to load the dataset in iterable mode or standard mode is passed to [`load_dataset`](https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/loading_methods#datasets.load_dataset)
188+
189+
```
190+
from datasets import load_dataset
191+
train_ds = load_dataset('imdb', split='train', streaming=True)
192+
```
193+
194+
Additionally, `SFTrainer` has support for [data formatting function](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support). Users can pass a `formatting_function` directly to `SFTtrainer` which formats the dataset
195+
for them,
196+
197+
```
198+
def formatting_prompts_func(example):
199+
output_texts = []
200+
for i in range(len(example['question'])):
201+
text = f"### Question: {example['question'][i]}\n ### Answer: {example['answer'][i]}"
202+
output_texts.append(text)
203+
return output_texts
204+
205+
trainer = SFTTrainer(
206+
model,
207+
args=training_args,
208+
train_dataset=dataset,
209+
formatting_func=formatting_prompts_func,
210+
)
211+
212+
trainer.train()
213+
```
214+
Taken from [HuggingFace docs](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support)
215+
216+
As our library is a wrapper on top of HF we cannot direclty allow users to pass a custom formatting function and
217+
our `data_handler` design can also support formatting dataset akin to `formatting function` where users specify just name of the handler and we apply formatting on our end. The design for `data_handler` that we have is a superset of this feature which is more flexible and can support many more use cases.
218+
219+
## Consequences
220+
221+
### Arguments Required
222+
In this design, apart from the `data_config` spec users will also need to pass the `--response_template` argument. This is because the `DataCollator` functionality of this library is not being touched by our design.
223+
224+
We also plan to add a new argument to `tuning.config.configs.DataArguments` which takes in the `data_config` file as input. like,
225+
```
226+
@dataclass
227+
class DataArguments:
228+
...
229+
data_config_file: str = field(
230+
default=None,
231+
metadata={
232+
"help": "data_config file which specifies the data preprocessing logic to apply.\
233+
Supports both JSON and YAML based config files"
234+
},
235+
)
236+
```
237+
238+
### Understanding the spec
239+
With this design we have tried to keep our design simple and close to the HF library as much as possible, e.g. exposing the same map `kwargs` that HF has in our `data_handlers`.
240+
241+
Despite this advanced users will need to understand the spec and be able to write it properly to be processed.
242+
Furthermore advanced users will need to educate themselves on the data handlers already present in the code. Since the data handlers are selected based on their name we need to ensure the documentation contains complete information on what different data handlers are present and how to use them in the data_config.
243+
244+
### Sharing config files
245+
We currently do not propose anything on how advanced users share the `data_config` files created by them with Intermediate and Simple users.
246+
247+
### Simple User Perspective
248+
249+
As mentioned above we are retaining the full functionality supported by `tuning.config.configs.DataArguments` which means simple users can continue using the library by passing a simple dataset via `--training_data_path` and use case specific arguments like `--data_formatter_template` as they please and the code will internally handle how to map these to the `data_config` spec.
250+
151251
### Intermediate User Perspective
152252
Our perspective is that the advanced users will create config files for data preprocessing and the intermediate users can use these existing configs and modify them according to their preference to get the desired result.
153253

154254
## Detailed Design
155-
<!-->
156-
This section is optional. Elaborate on details if they're important to understanding the design, but would make it hard to read the proposal section above.
157-
<-->
158255

159256
### The proposed design to implement support for this spec is follows,
160257

@@ -186,10 +283,9 @@ class DataPreProcessor(ABC):
186283

187284
At the top level we propose to have this `class DataPreProcessor` which is an abstract class and requires functions to process the data config proposed above.
188285

189-
We also propose a full length config verification code which preceeds the call to function `DataPreProcessor.process_data_config` as the function expects a `DataConfig` object.
190-
191286
The data pre processor needs to support custom data handlers. In the library for simple use cases we will provide predefined data handlers which need to be registered with the top level class using the
192287
call `DataPreProcessor.register_data_handler`.
288+
193289
The simple use cases will be handled using these data handlers and which data handler to choose will depend on the use case chosen from data args (same as the current code).
194290

195291
## How are handlers provided and registered -

0 commit comments

Comments
 (0)