Skip to content

Commit 8d4ba0b

Browse files
docs: update the --data_config flag to --data_config_path (foundation-model-stack#522)
Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>
1 parent 30ceecc commit 8d4ba0b

File tree

2 files changed

+4
-4
lines changed

2 files changed

+4
-4
lines changed

docs/advanced-data-preprocessing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ These things are supported via what we call a [`data_config`](#data-config) whic
99

1010
## Data Config
1111

12-
Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config` flag. In this
12+
Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config_path` flag. In this
1313
configuration users can describe multiple datasets, configurations on how to load the datasets and configuration on how to
1414
process the datasets. Users can currently pass both YAML or JSON based configuration files as data_configs.
1515

docs/ept.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ datasets:
4343
And the commandline passed to the library should include following.
4444

4545
```
46-
--data_config <path to the data config> --packing=True --max_seq_len 8192
46+
--data_config_path <path to the data config> --packing=True --max_seq_len 8192
4747
```
4848

4949
Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `Tweet` column before passing that as a dataset.
@@ -102,7 +102,7 @@ NOTE: More in-depth documentation of `sampling_stopping_strategy` and how to spe
102102
Here also the command line arguments would be
103103

104104
```
105-
--data_config <path to the data config> --packing=True --max_seq_len 8192
105+
--data_config_path <path to the data config> --packing=True --max_seq_len 8192
106106
```
107107

108108
The code again would add `EOS_TOKEN` to the non tokenized data before using it and also note that the `dataset_text_field` is assumed to be same across all datasets for now.
@@ -131,7 +131,7 @@ datasets:
131131
The command-line arguments passed to the library should include the following:
132132

133133
```
134-
--data_config <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
134+
--data_config_path <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
135135
```
136136

137137
Please note when using streaming, user must pass `max_steps` instead of `num_train_epochs`. See advanced data preprocessing [document](./advanced-data-preprocessing.md#data-streaming) for more info.

0 commit comments

Comments
 (0)