Skip to content

Commit 3ec30a0

Browse files
authored
Merge pull request #428 from foundation-model-stack/new_release_2.3.0
chore: merge set of changes for v2.3.0
2 parents 054a985 + 594dd37 commit 3ec30a0

File tree

52 files changed

+294324
-211
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+294324
-211
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ venv/
3232
# Aim
3333
.aim
3434

35+
# Mlflow
36+
mlruns/
37+
3538
# Backup files and folders
3639
*.bkp
3740
*.bkp.*

.pylintrc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -280,8 +280,8 @@ ignored-parents=
280280
# Maximum number of arguments for function / method.
281281
max-args=5
282282

283-
# Maximum number of attributes for a class (see R0902).
284-
max-attributes=7
283+
# Maximum number of attributes for a class (custom).
284+
max-attributes=10
285285

286286
# Maximum number of boolean expressions in an if statement (see R0916).
287287
max-bool-expr=5

README.md

Lines changed: 65 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,28 @@ pip install fms-hf-tuning[aim]
6161
```
6262
For more details on how to enable and use the trackers, Please see, [the experiment tracking section below](#experiment-tracking).
6363

64-
## Data format
65-
We support the following data formats:
64+
## Data Support
65+
Users can pass training data in a single file using the `--training_data_path` argument along with other arguments required for various [use cases](#use-cases-supported-with-training_data_path-argument) (see details below) and the file can be in any of the [supported formats](#supported-data-formats). Alternatively, you can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly.
6666

67-
### 1. JSON formats with a single sequence and a specified response_template to use for masking on completion.
6867

69-
#### 1.1 Pre-process the JSON/JSONL dataset
70-
Pre-process the JSON/JSONL dataset to contain a single sequence of each data instance containing input + Response. The trainer is configured to expect a response template as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code.
68+
Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md).
69+
70+
## Supported Data Formats
71+
We support the following data formats via `--training_data_path` argument
72+
73+
Data Format | Tested Support
74+
------------|---------------
75+
JSON | ✅
76+
JSONL | ✅
77+
PARQUET | ✅
78+
ARROW | ✅
79+
80+
## Use cases supported with `training_data_path` argument
81+
82+
### 1. Data formats with a single sequence and a specified response_template to use for masking on completion.
83+
84+
#### 1.1 Pre-process the dataset
85+
Pre-process the dataset to contain a single sequence of each data instance containing input + response. The trainer is configured to expect a `response template` as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code.
7186

7287
```python
7388
PROMPT_DICT = {
@@ -99,11 +114,10 @@ The `response template` corresponding to the above dataset and the `Llama` token
99114

100115
The same way can be applied to any dataset, with more info can be found [here](https://huggingface.co/docs/trl/main/en/sft_trainer#format-your-input-prompts).
101116

102-
Once the JSON is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer.
117+
Once the data is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer.
103118

104-
#### 1.2 Format JSON/JSONL on the fly
105-
Pass a JSON/JSONL and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of JSON with `{{field}}`. While tuning, the data will be converted to a single sequence using the template.
106-
JSON fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-".
119+
#### 1.2 Format the dataset on the fly
120+
Pass a dataset and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of the dataset with `{{field}}`. While tuning, the data will be converted to a single sequence using the template. Data fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-".
107121

108122
Example: Train.json
109123
`[{ "input" : <text>,
@@ -113,23 +127,58 @@ Example: Train.json
113127
]`
114128
data_formatter_template: `### Input: {{input}} \n\n##Label: {{output}}`
115129

116-
Formatting will happen on the fly while tuning. The keys in template should match fields in JSON file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`.
130+
Formatting will happen on the fly while tuning. The keys in template should match fields in the dataset file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`.
117131

118132
##### In conclusion, if using the reponse_template and single sequence, either the `data_formatter_template` argument or `dataset_text_field` needs to be supplied to the trainer.
119133

120-
### 2. JSON/JSONL with input and output fields (no response template)
134+
### 2. Dataset with input and output fields (no response template)
121135

122-
Pass a JSON/JSONL containing fields "input" with source text and "output" with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked.
136+
Pass a [supported dataset](#supported-data-formats) containing fields `"input"` with source text and `"output"` with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked.
123137

124-
The "input" and "output" field names are mandatory and cannot be changed.
138+
The `"input"` and `"output"` field names are mandatory and cannot be changed.
125139

126-
Example: Train.jsonl
140+
Example: For a JSON dataset like, `Train.jsonl`
127141

128142
```
129143
{"input": "### Input: Colorado is a state in USA ### Output:", "output": "USA : Location"}
130144
{"input": "### Input: Arizona is also a state in USA ### Output:", "output": "USA : Location"}
131145
```
132146

147+
### 3. Chat Style Single/Multi turn datasets
148+
149+
Pass a dataset containing single/multi turn chat dataset. Your dataset could follow this format:
150+
151+
```
152+
$ head -n 1 train.jsonl
153+
{"messages": [{"content": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "role": "system"}, {"content": "Look up a word that rhymes with exist", "role": "user"}, {"content": "I found a word that rhymes with \"exist\":\n1\\. Mist", "role": "assistant"}], "group": "lab_extension", "dataset": "base/full-extension", "metadata": "{\"num_turns\": 1}"}
154+
```
155+
156+
This format supports both single and multi-turn chat scenarios.
157+
158+
The chat template used to render the dataset will default to `tokenizer.chat_template` from the model's tokenizer configuration. This can be overridden using the `--chat_template <chat-template-string>` argument. For example, models like [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct), which include a [chat template](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/e0a466fb25b9e07e9c2dc93380a360189700d1f8/tokenizer_config.json#L188) in their `tokenizer_config.json`, do not require users to provide a chat template to process the data.
159+
160+
Users do need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of
161+
`assistant` and `human` response inside the formatted chat template.
162+
For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) for example, the values shall be.
163+
```
164+
--instruction_template "<|start_of_role|>user<|end_of_role|>"
165+
--response_template "<|start_of_role|>assistant<|end_of_role|>"
166+
```
167+
168+
The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.
169+
170+
### 3. Pre tokenized datasets.
171+
172+
Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g.
173+
174+
```
175+
python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow
176+
```
177+
178+
### 4. Advanced data preprocessing.
179+
180+
For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md).
181+
133182
## Supported Models
134183

135184
- For each tuning technique, we run testing on a single large model of each architecture type and claim support for the smaller models. For example, with QLoRA technique, we tested on granite-34b GPTBigCode and claim support for granite-20b-multilingual.
@@ -823,12 +872,13 @@ For details about how you can use set a custom stopping criteria and perform cus
823872

824873
## Experiment Tracking
825874

826-
Experiment tracking in fms-hf-tuning allows users to track their experiments with known trackers like [Aimstack](https://aimstack.io/) or custom trackers built into the code like
875+
Experiment tracking in fms-hf-tuning allows users to track their experiments with known trackers like [Aimstack](https://aimstack.io/), [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html) or custom trackers built into the code like
827876
[FileLoggingTracker](./tuning/trackers/filelogging_tracker.py)
828877

829878
The code supports currently two trackers out of the box,
830879
* `FileLoggingTracker` : A built in tracker which supports logging training loss to a file.
831880
* `Aimstack` : A popular opensource tracker which can be used to track any metrics or metadata from the experiments.
881+
* `MLflow Tracking` : Another popular opensource tracker which stores metrics, metadata or even artifacts from experiments.
832882

833883
Further details on enabling and using the trackers mentioned above can be found [here](docs/experiment-tracking.md).
834884

build/Dockerfile

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,9 @@ ARG USER=tuning
1919
ARG USER_UID=1000
2020
ARG PYTHON_VERSION=3.11
2121
ARG WHEEL_VERSION=""
22-
## Enable Aimstack if requested via ENABLE_AIM set to "true"
22+
## Enable Aimstack or MLflow if requested via ENABLE_AIM/MLFLOW set to "true"
2323
ARG ENABLE_AIM=false
24+
ARG ENABLE_MLFLOW=false
2425
ARG ENABLE_FMS_ACCELERATION=true
2526

2627
## Base Layer ##################################################################
@@ -151,6 +152,10 @@ RUN if [[ "${ENABLE_AIM}" == "true" ]]; then \
151152
python -m pip install --user "$(head bdist_name)[aim]"; \
152153
fi
153154

155+
RUN if [[ "${ENABLE_MLFLOW}" == "true" ]]; then \
156+
python -m pip install --user "$(head bdist_name)[mlflow]"; \
157+
fi
158+
154159
# Clean up the wheel module. It's only needed by flash-attn install
155160
RUN python -m pip uninstall wheel build -y && \
156161
# Cleanup the bdist whl file

0 commit comments

Comments
 (0)