Skip to content

Commit d48d483

Browse files
dushyantbehlwillmj
andauthored
docs: Add documentation on how to do EPT runs with our library. (#461)
* add documentation on ept Signed-off-by: Dushyant Behl <[email protected]> * Update docs/ept.md Co-authored-by: Will Johnson <[email protected]> Signed-off-by: Dushyant Behl <[email protected]> * Apply suggestions from code review Co-authored-by: Will Johnson <[email protected]> Signed-off-by: Dushyant Behl <[email protected]> Signed-off-by: Dushyant Behl <[email protected]> * Add additional information Signed-off-by: Dushyant Behl <[email protected]> * fix statement Signed-off-by: Dushyant Behl <[email protected]> --------- Signed-off-by: Dushyant Behl <[email protected]> Signed-off-by: Dushyant Behl <[email protected]> Co-authored-by: Will Johnson <[email protected]>
1 parent f88f031 commit d48d483

File tree

2 files changed

+116
-0
lines changed

2 files changed

+116
-0
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
- [Prompt Tuning](#prompt-tuning)
1414
- [Fine Tuning](#fine-tuning)
1515
- [FMS Acceleration](#fms-acceleration)
16+
- [Extended Pre-Training](#extended-pre-training)
1617
- [Inference](#inference)
1718
- [Running a single example](#running-a-single-example)
1819
- [Running multiple examples](#running-multiple-examples)
@@ -828,6 +829,9 @@ Number of trainable parameters = 13,631,488
828829
The `fms_acceleration.cli` can do more to search for all available configs, plugins and arguments, [see the advanced flow](https://github.com/foundation-model-stack/fms-acceleration#advanced-flow).
829830

830831

832+
## Extended Pre-Training
833+
834+
We also have support for extended pre training where users might wanna pretrain a model with large number of samples. Please refer our separate doc on [EPT Use Cases](./docs/ept.md)
831835

832836
## Inference
833837
Currently, we do *not* offer inference support as part of the library, but we provide a standalone script for running inference on tuned models for testing purposes. For a full list of options run `python scripts/run_inference.py --help`. Note that no data formatting / templating is applied at inference time.

docs/ept.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Extended Pre Training Support
2+
Our library also supports Extended Pre-Training (EPT), which is generally useful when users want to train a pretrained model on a large number of samples. The training behaviour of EPT is similar to that of pretraining where users might wanna make sure the models runs through entire corpus of data available and be trained on whole set of tokens without any specific masking.
3+
4+
See [below](#additional-information) for information on when this document was last updated and the release which supports this feature.
5+
6+
## Packing support
7+
8+
We support training via `packing` dataset samples by specifing `--packing=True` in the command line parameters. Users can choose to specify `--max_seq_len=<value like 4k/8k>` to provide the maxium sequence length of each chunk post packing.
9+
10+
We provide below details on how to use different style of datasets with the library.
11+
12+
## Non-Tokenized Dataset
13+
14+
### Single Non-Tokenized Dataset
15+
Users can pass a single dataset to the library by using a [data_config](./advanced-data-preprocessing.md#data-config).
16+
Lets say you have a `JSONL` data file which contains text to be trained on in each line that you want to perform EPT on, you can create a `data_config` for the dataset in this manner,
17+
18+
Example dataset,
19+
20+
```
21+
{"Tweet":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
22+
{"Tweet":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
23+
...
24+
```
25+
26+
Sample data config for the above use case.
27+
```
28+
dataprocessor:
29+
type: default
30+
datasets:
31+
- name: non_tokenized_text_dataset
32+
data_paths:
33+
- "<path-to-the-jsonl-dataset>"
34+
data_handlers:
35+
- name: apply_custom_data_formatting
36+
arguments:
37+
remove_columns: all
38+
batched: false
39+
fn_kwargs:
40+
dataset_text_field: "dataset_text_field"
41+
```
42+
43+
And the commandline passed to the library should include following.
44+
45+
```
46+
--data_config <path to the data config> --packing=True --max_seq_len 8192
47+
```
48+
49+
Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `Tweet` column before passing that as a dataset.
50+
51+
### Multiple Non Tokenized Datasets
52+
53+
If a user wants to utilize multiple datasets and want to [`sample`](./advanced-data-preprocessing.md#how-the-user-can-write-data-configs) the datasets. This can be achieved by specifying multiple datasets in the data config with different sampling ratios.
54+
55+
Sample data config for sampling among multiple datasets
56+
```
57+
dataprocessor:
58+
type: default
59+
sampling_stopping_strategy: first_exhausted
60+
seed: 66
61+
datasets:
62+
- name: non_tokenized_text_dataset_1
63+
sampling: 0.3
64+
data_paths:
65+
- "FILE_PATH"
66+
data_handlers:
67+
- name: apply_custom_data_formatting_template
68+
arguments:
69+
remove_columns: all
70+
batched: false
71+
fn_kwargs:
72+
dataset_text_field: "dataset_text_field"
73+
template: "dataset_template"
74+
- name: non_tokenized_text_dataset_2
75+
sampling: 0.4
76+
data_paths:
77+
- "FILE_PATH"
78+
data_handlers:
79+
- name: apply_custom_data_formatting_template
80+
arguments:
81+
remove_columns: all
82+
batched: false
83+
fn_kwargs:
84+
dataset_text_field: "dataset_text_field"
85+
template: "dataset_template"
86+
- name: non_tokenized_text_dataset_3
87+
sampling: 0.3
88+
data_paths:
89+
- "FILE_PATH"
90+
data_handlers:
91+
- name: apply_custom_data_formatting_template
92+
arguments:
93+
remove_columns: all
94+
batched: false
95+
fn_kwargs:
96+
dataset_text_field: "dataset_text_field"
97+
template: "dataset_template"
98+
```
99+
100+
NOTE: More in-depth documentation of `sampling_stopping_strategy` and how to specify data mixing parameters in the `data_config` is covered in the [data mixing](./advanced-data-preprocessing.md#data-mixing) section of the advanced data preprocessing documentation
101+
102+
Here also the command line arguments would be
103+
104+
```
105+
--data_config <path to the data config> --packing=True --max_seq_len 8192
106+
```
107+
108+
The code again would add `EOS_TOKEN` to the non tokenized data before using it and also note that the `dataset_text_field` is assumed to be same across all datasets for now.
109+
110+
### Additional Information
111+
This feature is supported post [v2.3.1](https://github.com/foundation-model-stack/fms-hf-tuning/releases/tag/v2.3.1) of this library.
112+
Post Last Updated On: 10-02-2025

0 commit comments

Comments
 (0)