You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For more details on how to enable and use the trackers, Please see, [the experiment tracking section below](#experiment-tracking).
63
63
64
-
## Data format
65
-
We support the following data formats:
64
+
## Data Support
65
+
Users can pass training data in a single file using the `--training_data_path` argument along with other arguments required for various [use cases](#use-cases-supported-with-training_data_path-argument) (see details below) and the file can be in any of the [supported formats](#supported-data-formats). Alternatively, you can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly.
66
66
67
-
### 1. JSON formats with a single sequence and a specified response_template to use for masking on completion.
68
67
69
-
#### 1.1 Pre-process the JSON/JSONL dataset
70
-
Pre-process the JSON/JSONL dataset to contain a single sequence of each data instance containing input + Response. The trainer is configured to expect a response template as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code.
68
+
Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md).
69
+
70
+
## Supported Data Formats
71
+
We support the following data formats via `--training_data_path` argument
72
+
73
+
Data Format | Tested Support
74
+
------------|---------------
75
+
JSON | ✅
76
+
JSONL | ✅
77
+
PARQUET | ✅
78
+
ARROW | ✅
79
+
80
+
## Use cases supported with `training_data_path` argument
81
+
82
+
### 1. Data formats with a single sequence and a specified response_template to use for masking on completion.
83
+
84
+
#### 1.1 Pre-process the dataset
85
+
Pre-process the dataset to contain a single sequence of each data instance containing input + response. The trainer is configured to expect a `response template` as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code.
71
86
72
87
```python
73
88
PROMPT_DICT= {
@@ -99,11 +114,10 @@ The `response template` corresponding to the above dataset and the `Llama` token
99
114
100
115
The same way can be applied to any dataset, with more info can be found [here](https://huggingface.co/docs/trl/main/en/sft_trainer#format-your-input-prompts).
101
116
102
-
Once the JSON is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer.
117
+
Once the data is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer.
103
118
104
-
#### 1.2 Format JSON/JSONL on the fly
105
-
Pass a JSON/JSONL and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of JSON with `{{field}}`. While tuning, the data will be converted to a single sequence using the template.
106
-
JSON fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-".
119
+
#### 1.2 Format the dataset on the fly
120
+
Pass a dataset and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of the dataset with `{{field}}`. While tuning, the data will be converted to a single sequence using the template. Data fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-".
Formatting will happen on the fly while tuning. The keys in template should match fields in JSON file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`.
130
+
Formatting will happen on the fly while tuning. The keys in template should match fields in the dataset file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`.
117
131
118
132
##### In conclusion, if using the reponse_template and single sequence, either the `data_formatter_template` argument or `dataset_text_field` needs to be supplied to the trainer.
119
133
120
-
### 2. JSON/JSONL with input and output fields (no response template)
134
+
### 2. Dataset with input and output fields (no response template)
121
135
122
-
Pass a JSON/JSONL containing fields "input" with source text and "output" with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked.
136
+
Pass a [supported dataset](#supported-data-formats)containing fields `"input"` with source text and `"output"` with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked.
123
137
124
-
The "input" and "output" field names are mandatory and cannot be changed.
138
+
The `"input"` and `"output"` field names are mandatory and cannot be changed.
125
139
126
-
Example: Train.jsonl
140
+
Example: For a JSON dataset like, `Train.jsonl`
127
141
128
142
```
129
143
{"input": "### Input: Colorado is a state in USA ### Output:", "output": "USA : Location"}
130
144
{"input": "### Input: Arizona is also a state in USA ### Output:", "output": "USA : Location"}
131
145
```
132
146
147
+
### 3. Chat Style Single/Multi turn datasets
148
+
149
+
Pass a dataset containing single/multi turn chat dataset. Your dataset could follow this format:
150
+
151
+
```
152
+
$ head -n 1 train.jsonl
153
+
{"messages": [{"content": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "role": "system"}, {"content": "Look up a word that rhymes with exist", "role": "user"}, {"content": "I found a word that rhymes with \"exist\":\n1\\. Mist", "role": "assistant"}], "group": "lab_extension", "dataset": "base/full-extension", "metadata": "{\"num_turns\": 1}"}
154
+
```
155
+
156
+
This format supports both single and multi-turn chat scenarios.
157
+
158
+
The chat template used to render the dataset will default to `tokenizer.chat_template` from the model's tokenizer configuration. This can be overridden using the `--chat_template <chat-template-string>` argument. For example, models like [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct), which include a [chat template](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/e0a466fb25b9e07e9c2dc93380a360189700d1f8/tokenizer_config.json#L188) in their `tokenizer_config.json`, do not require users to provide a chat template to process the data.
159
+
160
+
Users do need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of
161
+
`assistant` and `human` response inside the formatted chat template.
162
+
For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) for example, the values shall be.
The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.
169
+
170
+
### 3. Pre tokenized datasets.
171
+
172
+
Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g.
For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md).
181
+
133
182
## Supported Models
134
183
135
184
- For each tuning technique, we run testing on a single large model of each architecture type and claim support for the smaller models. For example, with QLoRA technique, we tested on granite-34b GPTBigCode and claim support for granite-20b-multilingual.
@@ -823,12 +872,13 @@ For details about how you can use set a custom stopping criteria and perform cus
823
872
824
873
## Experiment Tracking
825
874
826
-
Experiment tracking in fms-hf-tuning allows users to track their experiments with known trackers like [Aimstack](https://aimstack.io/) or custom trackers built into the code like
875
+
Experiment tracking in fms-hf-tuning allows users to track their experiments with known trackers like [Aimstack](https://aimstack.io/), [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html) or custom trackers built into the code like
0 commit comments