Skip to content

Commit 1619bf9

Browse files
authored
Update offline-data-preprocessing.md
Signed-off-by: Dushyant Behl <[email protected]>
1 parent 8c16f2d commit 1619bf9

File tree

1 file changed

+7
-4
lines changed

1 file changed

+7
-4
lines changed

docs/offline-data-preprocessing.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,8 @@ python -m tuning.sft_trainer \
3333
--output_dir /path/to/output/directory \
3434
--log_level info \
3535
--num_train_dataset_shards 10 \
36-
--num_eval_dataset_shards 1
36+
--num_eval_dataset_shards 1 \
37+
--do_dataprocessing_only
3738
```
3839

3940
Additionally, once the offline data processing is complete, users can leverage the shards stored in `output_dir` for tuning by passing it through the `--training_data_path` flag or passing it via `data_paths` argument in data config yaml, provided they find the sharded datasets beneficial for training.
@@ -113,7 +114,8 @@ python -m tuning.sft_trainer \
113114
--response_template "<|start_of_role|>assistant<|end_of_role|>" \
114115
--split_batches "true" \
115116
--use_flash_attn "true" \
116-
--num_train_dataset_shards "10"
117+
--num_train_dataset_shards "10" \
118+
--do_dataprocessing_only
117119
```
118120

119121
The resulting shards are saved in the directory `/test/data/offline_processing_shards`, as specified by the `--output_dir` argument. These shards can then be used for tuning the model by pointing the `training_data_path` argument to the directory where the shards are stored—in this example,
@@ -175,5 +177,6 @@ accelerate launch \
175177
--use_reentrant="true" \
176178
--warmup_ratio="0.1" \
177179
--warmup_steps="200" \
178-
--weight_decay="0.1"
179-
```
180+
--weight_decay="0.1" \
181+
--do_dataprocessing_only
182+
```

0 commit comments

Comments
 (0)