You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please note, this document is intended for advanced users who want to customize data handler arguments and use data handler functions to perform
3
+
complex operations on the data configs.
4
+
2
5
Data handlers, are routines which process a dataset using [HF process frameworks](https://huggingface.co/docs/datasets/en/process) including map, filter, remove, select, and rename.
3
6
All data handler routines are registered with our data preprocessor as a `k:func` object where
4
7
`k` is the name (`str`) of the data handler and `func` (`callable`) is the function which is called.
5
8
6
9
In the data config, users can request which data handler to apply by requesting the corresponding `name`
7
10
with which the data handler was registered and specifying the appropriate `arguments`. Each data handler accepts two types of arguments via `DataHandlerArguments` (as defined in the data preprocessor [schema](./advanced-data-preprocessing.md#what-is-data-config-schema)), as shown below.
8
11
9
-
```yaml
10
-
datapreprocessor:
11
-
...
12
-
datasets:
13
-
- name: ...
14
-
data_paths:
15
-
- ...
16
-
data_handlers:
17
-
- name: str
18
-
arguments:
19
-
argument: object
20
-
...
21
-
argument: object
22
-
fn_kwargs:
23
-
fn_kwarg: object
24
-
...
25
-
fn_kwarg: object
26
-
...
27
-
```
28
-
29
12
Arguments to the data handlers are of two types,
30
13
31
14
Each data handler is a routine passed to an underlying HF API so the `kwargs` supported by the underlying API can be passed via the `arguments` section of the data handler config. In our pre-existing handlers the supported underlying API is either:
@@ -41,156 +24,221 @@ For example, users can pass `batched` through `arguments` to ensure [batched pro
41
24
42
25
Users can also pass any number of `kwargs` arguments required for each data handling `routine` function as [`fn_kwargs`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map.fn_kwargs) inside the arguments.
43
26
27
+
A typical YAML snippet where you'd specify arguments to the handlers
28
+
```
29
+
datapreprocessor:
30
+
...
31
+
datasets:
32
+
- name: my_dataset
33
+
data_paths:
34
+
- /path/to/my_dataset
35
+
data_handlers:
36
+
- name: tokenize
37
+
arguments:
38
+
# Additional kwargs passed directly to the underlying HF API call
39
+
batched: false
40
+
num_proc: 10
41
+
42
+
fn_kwargs:
43
+
# Any arguments specific to the tokenize handler itself
44
+
truncation: true
45
+
max_length: 1280
46
+
```
47
+
48
+
For example, `num_proc` and `batched` in the snippet above are passed straight to
49
+
`datasets.Dataset.map(...) ` while, the `truncation` and `max_length` arguments
50
+
in the snippet above directly control how the handler performs tokenization.
51
+
52
+
For native handlers like `REMOVE``RENAME``SELECT` (see below) you don't need to pass `fn_kwargs` and args need to be provided in `arguments`.
53
+
54
+
### Default Arguments
55
+
Each data handler supports many arguments and some of them are automatically provided to the data handler via the data processor framework.
56
+
The data processor framework makes these arguments available to the data handlers via `kwargs`.
57
+
58
+
1.`tokenizer`: The `AutoTokenizer` representation of the `tokenizer_name_or_path` or from `model_name_or_path` arg passed to the library.
59
+
2.`column_names`: The names of the columns of the current dataset being processed.
60
+
61
+
**Also one special handling data preprocessor provides is to pass in `remove_columns` as `all` which will internally be translated to all column names to the `Map` of `Filter` data handler routines.**
44
62
45
63
## Preexisting data handlers
46
64
This library currently supports the following preexisting data handlers. These handlers could be requested by their same name and users can lookup the function args from [data handlers source code](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py):
47
65
48
-
49
66
### `tokenize_and_apply_input_masking`:
50
67
Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.
51
-
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/tokenize_and_apply_input_masking.yaml)
68
+
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, see[this](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/tokenize_and_apply_input_masking.yaml)or the `add_eos_token` argument below.
52
69
53
-
Type: MAP
70
+
Users don't need to pass any extra `response` or `instruction` templates here.
54
71
55
-
Args:
72
+
**Type: MAP**
73
+
74
+
**arguments**
75
+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
76
+
77
+
**fn_args:**
56
78
-`element`: the HF Dataset element.
57
-
-`tokenizer`: Tokenizer to be used for tokenization.
58
-
-`column_names`: Name of all the columns in the dataset.
59
-
-`input_field_name`: Name of the input (instruction) field in dataset
60
-
-`output_field_name`: Name of the output field in dataset
79
+
-`input_column_name`: Name of the input (instruction) field in dataset
80
+
-`output_column_name`: Name of the output field in dataset
61
81
-`add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True
62
82
63
-
Returns formatted Dataset element with input_ids, labels and attention_mask columns
83
+
**Returns:**
84
+
- tokenized Dataset element with input_ids, labels and attention_mask columns where labels contain masking of the `input` section of the dataset.
85
+
86
+
### `apply_custom_jinja_template`:
87
+
Applies a custom jinja template (e.g., Alpaca style) to format dataset elements.
88
+
Returns dataset which contains column `formatted_text_column_name` containing the string formatted using provided template.
64
89
65
-
### `add_tokenizer_eos_token`:
66
-
Appends the tokenizer's EOS token to a specified dataset field.
90
+
Users need to pass in appropriate `response_template` if they specify this handler as the final handler to ensure the
91
+
`DataCollatorForCompletionOnlyLM` used underneath to apply proper masking ensure the model learns only on responses.
67
92
68
-
Type: MAP
93
+
**Type: MAP**
69
94
70
-
Args:
71
-
-`element`: the HF Dataset element.
72
-
-`tokenizer`: Tokenizer to be used for the EOS token, which will be appended when formatting the data into a single sequence. Defaults to empty.
73
-
-`dataset_text_field`: Text column name of the dataset where EOS is to be added.
95
+
**arguments**
96
+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
74
97
75
-
Returns formatted Dataset element with EOS added to dataset_text_field of the element.
-`template`: Jinja template to format data with. Features of Dataset should be referred to by their key.
76
102
77
-
### `apply_custom_data_formatting_template`:
78
-
Applies a custom template (e.g., Alpaca style) to format dataset elements.
79
-
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_template.yaml)
103
+
**Returns:**
104
+
- Formatted HF Dataset element by formatting dataset with provided jinja template, saving the result to `formatted_text_column_name` argument.
80
105
81
-
Type: MAP
106
+
### `apply_tokenizer_chat_template`:
107
+
Uses tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
108
+
Returns dataset which contains column `formatted_text_column_name` containing the chat template formatted string.
82
109
83
-
Args:
84
-
-`element`: the HF Dataset element.
85
-
-`tokenizer`: Tokenizer to be used for the EOS token, which will be appended when formatting the data into a single sequence. Defaults to empty.
86
-
-`dataset_text_field`: Text column name of the dataset where formatted text is saved.
87
-
-`template`: Template to format data with. Features of Dataset should be referred to by their key.
88
-
-`add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True.
110
+
Since this handler does not tokenize the dataset users need to provide appropriate `resonse_template` and `instruction_template` for the
111
+
`DataCollatorForCompletionOnlyLM` used underneath to apply proper masking ensure the model learns only on assistant responses.
89
112
90
-
Returns formatted Dataset element by formatting dataset with template+tokenizer.EOS_TOKEN, saving the result to dataset_text_field argument.
113
+
**Type: MAP**
91
114
92
-
### `apply_custom_jinja_template`:
93
-
Applies a custom jinja template (e.g., Alpaca style) to format dataset elements.
94
-
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_jinja_template.yaml)
115
+
**arguments**
116
+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
95
117
96
-
Type: MAP
118
+
**fn_args:**
119
+
-`element`: the HF Dataset element.
120
+
-`formatted_text_column_name`: the field in which to store the rendered text.
121
+
-`conversation_column`: column name where the chat template expects the conversation
97
122
98
-
Args:
99
-
-`element`: the HF Dataset element
100
-
-`tokenizer`: Tokenizer to be used for the EOS token, which will be appended
101
-
when formatting the data into a single sequence. Defaults to empty.
102
-
-`dataset_text_field`: formatted_dataset_field.
103
-
-`template`: Template to format data with. Features of Dataset should be referred to by their key.
104
-
-`add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True.
123
+
**Returns:**
124
+
- Formatted HF Dataset element by formatting dataset with tokenizer's chat template, saving the result to `formatted_text_column_name` argument.
105
125
106
-
Returns formatted HF Dataset element by formatting dataset with provided jinja template, saving the result to dataset_text_field argument.
Uses tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
128
+
Then tokenizes the dataset while masking all user and system conversations ensuring model learns only on assistant responses.
129
+
Tokenizes the dataset so you don't need to pass any extra arguments for data collator.
107
130
108
-
### `apply_tokenizer_chat_template`:
109
-
Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
131
+
**Type: MAP**
110
132
111
-
Type: MAP
133
+
**arguments**
134
+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
135
+
*Note: - Always recommended to be used with `remove_columns:all` as argument as you don't want to retain text columns and tokenized columns alongside while training which can cause a potential crash.
112
136
113
-
Args:
137
+
**fn_args:**
114
138
-`element`: the HF Dataset element.
115
-
-`tokenizer`: Tokenizer to be used.
116
-
-`dataset_text_field`: the field in which to store the rendered text.
139
+
-`formatted_text_column_name`: the field in which to store the rendered text.
117
140
-`conversation_column`: column name where the chat template expects the conversation
118
141
119
-
Returns formatted HF Dataset element by formatting dataset with tokenizer's chat template, saving the result to dataset_text_field argument.
142
+
**Returns:**
143
+
- Tokenized Dataset element containing `input_ids``labels` and `attention_mask`.
120
144
121
145
### `tokenize`:
122
-
Tokenizes one column of the dataset passed as input `dataset_text_field`.
146
+
Tokenizes one column of the dataset passed as input `text_column_name`.
147
+
148
+
**Type: MAP**
123
149
124
-
Type: MAP
150
+
**arguments**
151
+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
125
152
126
-
Args:
153
+
**fn_kwargs:**
127
154
-`element`: the HF Dataset element.
128
-
-`tokenizer`: Tokenizer to be used.
129
-
-`dataset_text_field`: The dataset field to tokenize.
155
+
-`text_column_name`: The dataset field to tokenize.
130
156
-`truncation`: Truncation strategy to use, refer the link (https://huggingface.co/docs/transformers/en/pad_truncation).
131
157
-`max_length`: Max length to truncate the samples to.
132
158
133
-
Returns tokenized dataset element field `dataset_text_field`
159
+
**Return:**
160
+
- Tokenized dataset element field `text_column_name` containing `input_ids` and `labels`
134
161
135
162
### `duplicate_columns`:
136
163
Duplicate one columne of a dataset to another new column.
137
164
138
-
Type: MAP
165
+
**Type: MAP**
166
+
167
+
**arguments**
168
+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
139
169
140
-
Args:
170
+
**fn_args:**
141
171
-`element`: the HF Dataset element
142
-
-`old_column`: Name of the column to be duplicated
143
-
-`new_column`: Name of the new column where dyplicated column is saved
172
+
-`existing_column_name`: Name of the column to be duplicated
173
+
-`new_column_name`: Name of the new column where dyplicated column is saved
144
174
145
-
Returns formatted HF dataset element with `new_column` where `old_column` content is deep copied.
175
+
**Return:**
176
+
- Formatted HF dataset element with `new_column_name` where `existing_column_name` content is copied.
146
177
147
-
### `skip_large_columns`:
178
+
### `skip_samples_with_large_columns`:
148
179
Skips elements which contains certain columns larger than the passed max length in the dataset.
149
180
150
-
Type: FILTER
181
+
**Type: FILTER**
151
182
152
-
Args:
183
+
**arguments**
184
+
- Any arguments supported by [HF Filter API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.filter)
185
+
186
+
**fn_args**:
153
187
-`element`: HF dataset element.
154
188
-`column_name`: Name of column to be filtered.
155
-
-`max_length`: Max allowed lenght of column in either characters or tokens.
189
+
-`max_allowed_length`: Max allowed lenght of column in either characters or tokens.
156
190
157
-
Returns a filtered dataset which contains elements with length shorter than max length
191
+
**Return:**
192
+
- A filtered dataset which contains elements with length of column `column_name` shorter than max allowed length
158
193
159
194
### `remove_columns`:
160
-
Directly calls [remove_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns) in HF API
195
+
Directly calls [remove_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns) in HF API over the dataset.
161
196
162
-
Type: REMOVE
197
+
**Type: REMOVE**
163
198
164
-
Args:
199
+
**arguments**:
165
200
-`column_names`: Names of columns to be removed from dataset
166
201
167
-
Removes specified columns of dataset
202
+
**fn_args**:
203
+
- Nil. As this is a Native API.
204
+
205
+
**Returns:**
206
+
- Dataset with specified `column_names` removed
168
207
169
208
### `select_columns`:
170
209
Directly calls [select](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.select) in HF API
171
210
172
-
Type: SELECT
211
+
**Type: SELECT**
173
212
174
-
Args:
213
+
**arguments**:
175
214
-`column_names`: Names of columns to be retained in the new dataset
176
215
177
-
Create a new dataset with rows selected following the list/array of indices.
216
+
**fn_args**:
217
+
- Nil. As this is a Native API.
218
+
219
+
**Returns:**
220
+
- Dataset where only columns specified in `column_names` are retained.
178
221
179
222
### `rename_columns`:
180
223
Directly calls [rename_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.rename_columns) in HF API
181
224
182
-
Type: RENAME
225
+
**Type: RENAME**
183
226
184
-
Args:
227
+
**arguments**:
185
228
-`column_mapping`: Column names passed as `str:str` from `old_name:new_name`
186
229
187
-
Returns renamed columns in dataset using provided column mapping.
230
+
**fn_args**:
231
+
- Nil. As this is a Native API.
232
+
233
+
**Returns:**
234
+
- Dataset where columns are renamed using provided column mapping.
235
+
236
+
## Additional arguments
237
+
Please note that the choice of extra arguments needed with handler depends on how the dataset looks post processing which is a combination post
238
+
application of the full DAG of the data handlers and should be used be referring to our other documentation [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/README.md) and [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/docs/advanced-data-preprocessing.md) and reference templates provided [here](https://github.com/foundation-model-stack/fms-hf-tuning/tree/main/tests/artifacts/predefined_data_configs)
188
239
189
240
190
241
## Extra data handlers
191
242
Users are also allowed to pass custom data handlers using [`sft_trainer.py::train()`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L71) API call via the [`additional_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L89) argument.
192
243
193
-
The argument expects users to pass a map similar to the existing data handlers `k(str):func(callable)` which will be registered with the data preprocessor via its [`register_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/data/data_processors.py#L65) api
194
-
195
-
## Examples
196
-
To see typical use-cases and how handlers are linked together, see [data preprocessing recipes](./data-preprocessing-recipes.md).
244
+
The argument expects users to pass a map similar to the existing data handlers `k(str):func(callable)` which will be registered with the data preprocessor via its [`register_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/data/data_processors.py#L65) api
0 commit comments