You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When your contribution is ready, you can create a pull request. Pull requests are often referred to as "PR". In general, we follow the standard [GitHub pull request](https://help.github.com/en/articles/about-pull-requests) process. Follow the template to provide details about your pull request to the maintainers. It's best to break your contribution into smaller PRs with incremental changes, and include a good description of the changes.
34
-
We require new unit tests to be contributed with any new functionality added.
33
+
When your contribution is ready, you can create a pull request. Pull requests are often referred to as "PR". In general, we follow the standard [GitHub pull request](https://help.github.com/en/articles/about-pull-requests) process. Follow the template to provide details about your pull request to the maintainers.
34
+
1. It's best to break your contribution into smaller PRs with incremental changes, and include a good description of the changes in the PR description.
35
+
2. We require new unit tests to be contributed with any new functionality added.
36
+
3. We require each feature to be documented as part of the PR. If certain feature is experimental and not documented it will be announced as a dev preview.
37
+
4. We require any new unit tests that are gated by conditions such as package availability must be executed, and details of those along with a screenshot of the test results should be included in the PR description.
35
38
36
39
Before sending pull requests, make sure your changes pass formatting, linting and unit tests. These checks will run with the pull request builds. Alternatively, you can run the checks manually on your local machine [as specified below](#development).
37
40
@@ -50,6 +53,8 @@ Once you've [created a pull request](#how-can-i-contribute), maintainers will re
50
53
- Follow the project coding conventions
51
54
- Write detailed commit messages
52
55
- Break large changes into a logical series of smaller patches, which are easy to understand individually and combine to solve a broader issue
56
+
- Ensure documentation is added on `how to use` any new capabilities.
57
+
- Ensure follow-up issues are created for documentation and that feature is not officially released without full documentation.
53
58
54
59
Maintainers will perform "squash and merge" actions on PRs in this repo, so it doesn't matter how many commits your PR has, as they will end up being a single commit after merging.
Copy file name to clipboardExpand all lines: docs/advanced-data-preprocessing.md
+4-40Lines changed: 4 additions & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ These things are supported via what we call a [`data_config`](#data-config) whic
9
9
10
10
## Data Config
11
11
12
-
Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config` flag. In this
12
+
Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config_path` flag. In this
13
13
configuration users can describe multiple datasets, configurations on how to load the datasets and configuration on how to
14
14
process the datasets. Users can currently pass both YAML or JSON based configuration files as data_configs.
15
15
@@ -255,7 +255,7 @@ Needless to say the sampling ratio of a datasets is a float and all the sampling
255
255
We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to randomize the interleaving of datasets and a [`stopping_strategy`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy) to describe when to stop sampling. Both values should remain the same for experiment reproducibility. Both these values are common for all datasets and should be supplied at top level in the `datapreprocessor` as shown [above](#how-the-user-can-write-data-configs). For a list of the supported values of these arguments see the corresponding HF API.
256
256
257
257
258
-
`Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset`
258
+
Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
259
259
260
260
### Data Streaming
261
261
Dataset streaming allows users to utilize the functionality of iterable datasets to pass in data piece by piece, avoiding memory constraints with large datasets for use-cases like extended pre-training.
@@ -271,6 +271,8 @@ dataprocessor:
271
271
272
272
When using streaming, `split_batches` in the `TrainingArguments` will automatically be set to `True`, by doing so, the main process will fetch a full batch and slice it into `num_processes` batches for each process. This means that `num_processes` must be divisible by `batch_size`. This will replace the global batch size.
273
273
274
+
Note: Streaming datasets or use of `IterableDatasets` is not compatible with the fms-acceleration multipack plugin because multipack sampler has to run thorugh the full dataset every epoch. Using multipack and streaming together will raise an error.
275
+
274
276
**When using streaming, the user must set `max_steps` in the `TrainingArguments` instead of `num_train_epochs`.** Since iterable datasets are loaded chunk-by-chunk, data cannot run through epochs in a typical fashion as the **Trainer** can not know length of the dataset as it is being passed through. If both `max_steps` and `num_train_epochs` are given in a training config, `max_steps` will overwrite `num_train_epochs` since `max_steps` directly specifies the total number of optimization steps, which is needed when dataset length cannot be known.
275
277
276
278
If the dataset size is known to the user, `max_steps` can be calculated as the total number of samples divided by the batch size.
@@ -279,42 +281,4 @@ If the dataset size is known to the user, `max_steps` can be calculated as the t
279
281
280
282
We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
281
283
282
-
## Offline Data preprocessing
283
-
284
-
[This script](../scripts/offline_data_processing.py) provides the capability for users to perform standalone data
285
-
preprocessing, decoupled from the tuning/training part. It processes raw datasets, performs data preprocessing, and
286
-
saves the train and validation datasets (in shards if `--num_dataset_shards` if passed) in parquet format inside the specified `output_dir`.
287
-
A data config YAML file can be used to pass configuration to this script. Example command to run this script:
Copy file name to clipboardExpand all lines: docs/ept.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,7 @@ datasets:
43
43
And the commandline passed to the library should include following.
44
44
45
45
```
46
-
--data_config <path to the data config> --packing=True --max_seq_len 8192
46
+
--data_config_path <path to the data config> --packing=True --max_seq_len 8192
47
47
```
48
48
49
49
Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `Tweet` column before passing that as a dataset.
@@ -102,7 +102,7 @@ NOTE: More in-depth documentation of `sampling_stopping_strategy` and how to spe
102
102
Here also the command line arguments would be
103
103
104
104
```
105
-
--data_config <path to the data config> --packing=True --max_seq_len 8192
105
+
--data_config_path <path to the data config> --packing=True --max_seq_len 8192
106
106
```
107
107
108
108
The code again would add `EOS_TOKEN` to the non tokenized data before using it and also note that the `dataset_text_field` is assumed to be same across all datasets for now.
@@ -131,7 +131,7 @@ datasets:
131
131
The command-line arguments passed to the library should include the following:
132
132
133
133
```
134
-
--data_config <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
134
+
--data_config_path <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
135
135
```
136
136
137
137
Please note when using streaming, user must pass `max_steps` instead of `num_train_epochs`. See advanced data preprocessing [document](./advanced-data-preprocessing.md#data-streaming) for more info.
0 commit comments