Skip to content

Commit d86cb24

Browse files
committed
Merge tag 'v2.8.0-rc.2' into v2.8.0-rc2
Signed-off-by: Will Johnson <[email protected]>
2 parents 456fe2a + a84b716 commit d86cb24

File tree

70 files changed

+1552251
-531
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+1552251
-531
lines changed

.pylintrc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ ignored-parents=
281281
max-args=5
282282

283283
# Maximum number of attributes for a class (custom).
284-
max-attributes=10
284+
max-attributes=15
285285

286286
# Maximum number of boolean expressions in an if statement (see R0916).
287287
max-bool-expr=5
@@ -299,7 +299,7 @@ max-parents=7
299299
max-public-methods=20
300300

301301
# Maximum number of return / yield for function / method body.
302-
max-returns=6
302+
max-returns=10
303303

304304
# Maximum number of statements in function / method body.
305305
max-statements=50
@@ -475,7 +475,7 @@ notes-rgx=
475475
[REFACTORING]
476476

477477
# Maximum number of nested blocks for function / method body
478-
max-nested-blocks=5
478+
max-nested-blocks=6
479479

480480
# Complete name of functions that never returns. When checking for
481481
# inconsistent-return-statements if a never returning function is called then

CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@
88
# https://help.github.com/en/articles/about-code-owners
99
#
1010

11-
* @anhuong @Ssukriti @aluu317 @fabianlim @kmehant
11+
* @anhuong @dushyantbehl @aluu317 @fabianlim @kmehant

CONTRIBUTING.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,11 @@ To contribute to this repo, you'll use the Fork and Pull model common in many op
3030
Guide](https://github.com/kubernetes/community/blob/master/contributors/guide/github-workflow.md)
3131
from Kubernetes.
3232

33-
When your contribution is ready, you can create a pull request. Pull requests are often referred to as "PR". In general, we follow the standard [GitHub pull request](https://help.github.com/en/articles/about-pull-requests) process. Follow the template to provide details about your pull request to the maintainers. It's best to break your contribution into smaller PRs with incremental changes, and include a good description of the changes.
34-
We require new unit tests to be contributed with any new functionality added.
33+
When your contribution is ready, you can create a pull request. Pull requests are often referred to as "PR". In general, we follow the standard [GitHub pull request](https://help.github.com/en/articles/about-pull-requests) process. Follow the template to provide details about your pull request to the maintainers.
34+
1. It's best to break your contribution into smaller PRs with incremental changes, and include a good description of the changes in the PR description.
35+
2. We require new unit tests to be contributed with any new functionality added.
36+
3. We require each feature to be documented as part of the PR. If certain feature is experimental and not documented it will be announced as a dev preview.
37+
4. We require any new unit tests that are gated by conditions such as package availability must be executed, and details of those along with a screenshot of the test results should be included in the PR description.
3538

3639
Before sending pull requests, make sure your changes pass formatting, linting and unit tests. These checks will run with the pull request builds. Alternatively, you can run the checks manually on your local machine [as specified below](#development).
3740

@@ -50,6 +53,8 @@ Once you've [created a pull request](#how-can-i-contribute), maintainers will re
5053
- Follow the project coding conventions
5154
- Write detailed commit messages
5255
- Break large changes into a logical series of smaller patches, which are easy to understand individually and combine to solve a broader issue
56+
- Ensure documentation is added on `how to use` any new capabilities.
57+
- Ensure follow-up issues are created for documentation and that feature is not officially released without full documentation.
5358

5459
Maintainers will perform "squash and merge" actions on PRs in this repo, so it doesn't matter how many commits your PR has, as they will end up being a single commit after merging.
5560

README.md

Lines changed: 144 additions & 62 deletions
Large diffs are not rendered by default.

build/Dockerfile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -149,16 +149,18 @@ RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
149149
python -m pip install --user wheel && \
150150
python -m pip install --user "$(head bdist_name)" && \
151151
python -m pip install --user "$(head bdist_name)[flash-attn]" && \
152-
python -m pip install --user "$(head bdist_name)[mamba]"
152+
python -m pip install --user --no-build-isolation "$(head bdist_name)[mamba]"
153153

154154
# fms_acceleration_peft = PEFT-training, e.g., 4bit QLoRA
155155
# fms_acceleration_foak = Fused LoRA and triton kernels
156156
# fms_acceleration_aadp = Padding-Free Flash Attention Computation
157+
# fms_acceleration_moe = Parallelized Mixture of Experts
157158
RUN if [[ "${ENABLE_FMS_ACCELERATION}" == "true" ]]; then \
158159
python -m pip install --user "$(head bdist_name)[fms-accel]"; \
159160
python -m fms_acceleration.cli install fms_acceleration_peft; \
160161
python -m fms_acceleration.cli install fms_acceleration_foak; \
161162
python -m fms_acceleration.cli install fms_acceleration_aadp; \
163+
python -m fms_acceleration.cli install fms_acceleration_moe; \
162164
fi
163165

164166
RUN if [[ "${ENABLE_AIM}" == "true" ]]; then \

build/accelerate_launch.py

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,17 @@ def main():
146146
save_model_dir, save_model_dir, num_added_tokens
147147
)
148148

149+
# In case of ScatterMoE LoRa
150+
hf_converted_checkpoint = os.path.join(
151+
save_model_dir, "hf_converted_checkpoint"
152+
)
153+
if os.path.exists(
154+
os.path.join(hf_converted_checkpoint, "adapter_model.safetensors")
155+
):
156+
post_process_vLLM_adapters_new_tokens(
157+
hf_converted_checkpoint, hf_converted_checkpoint, num_added_tokens
158+
)
159+
149160
if (
150161
os.path.exists(os.path.join(output_dir, "added_tokens_info.json"))
151162
and job_config.get("save_strategy") != "no"
@@ -159,11 +170,30 @@ def main():
159170
for _, dirs, _ in os.walk(output_dir, topdown=False):
160171
for name in dirs:
161172
if "checkpoint-" in name.lower():
162-
post_process_vLLM_adapters_new_tokens(
163-
os.path.join(output_dir, name),
164-
os.path.join(output_dir, name),
165-
num_added_tokens,
173+
base_checkpoint_dir = os.path.join(output_dir, name)
174+
hf_converted_checkpoint = os.path.join(
175+
base_checkpoint_dir, "hf_converted_checkpoint"
176+
)
177+
178+
# Use hf_converted_checkpoint if exists, otherwise use base_checkpoint_dir
179+
checkpoint_dir = (
180+
hf_converted_checkpoint
181+
if os.path.exists(
182+
os.path.join(
183+
hf_converted_checkpoint, "adapter_model.safetensors"
184+
)
185+
)
186+
else base_checkpoint_dir
166187
)
188+
189+
if os.path.exists(
190+
os.path.join(checkpoint_dir, "adapter_model.safetensors")
191+
):
192+
post_process_vLLM_adapters_new_tokens(
193+
checkpoint_dir,
194+
checkpoint_dir,
195+
num_added_tokens,
196+
)
167197
else:
168198
logging.warning(
169199
"Failed to post-process: file added_tokens_info.json not in path %s",

docs/advanced-data-preprocessing.md

Lines changed: 4 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ These things are supported via what we call a [`data_config`](#data-config) whic
99

1010
## Data Config
1111

12-
Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config` flag. In this
12+
Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config_path` flag. In this
1313
configuration users can describe multiple datasets, configurations on how to load the datasets and configuration on how to
1414
process the datasets. Users can currently pass both YAML or JSON based configuration files as data_configs.
1515

@@ -255,7 +255,7 @@ Needless to say the sampling ratio of a datasets is a float and all the sampling
255255
We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to randomize the interleaving of datasets and a [`stopping_strategy`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy) to describe when to stop sampling. Both values should remain the same for experiment reproducibility. Both these values are common for all datasets and should be supplied at top level in the `datapreprocessor` as shown [above](#how-the-user-can-write-data-configs). For a list of the supported values of these arguments see the corresponding HF API.
256256

257257

258-
`Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset`
258+
Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
259259

260260
### Data Streaming
261261
Dataset streaming allows users to utilize the functionality of iterable datasets to pass in data piece by piece, avoiding memory constraints with large datasets for use-cases like extended pre-training.
@@ -271,6 +271,8 @@ dataprocessor:
271271
272272
When using streaming, `split_batches` in the `TrainingArguments` will automatically be set to `True`, by doing so, the main process will fetch a full batch and slice it into `num_processes` batches for each process. This means that `num_processes` must be divisible by `batch_size`. This will replace the global batch size.
273273
274+
Note: Streaming datasets or use of `IterableDatasets` is not compatible with the fms-acceleration multipack plugin because multipack sampler has to run thorugh the full dataset every epoch. Using multipack and streaming together will raise an error.
275+
274276
**When using streaming, the user must set `max_steps` in the `TrainingArguments` instead of `num_train_epochs`.** Since iterable datasets are loaded chunk-by-chunk, data cannot run through epochs in a typical fashion as the **Trainer** can not know length of the dataset as it is being passed through. If both `max_steps` and `num_train_epochs` are given in a training config, `max_steps` will overwrite `num_train_epochs` since `max_steps` directly specifies the total number of optimization steps, which is needed when dataset length cannot be known.
275277
276278
If the dataset size is known to the user, `max_steps` can be calculated as the total number of samples divided by the batch size.
@@ -279,42 +281,4 @@ If the dataset size is known to the user, `max_steps` can be calculated as the t
279281
280282
We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
281283
282-
## Offline Data preprocessing
283-
284-
[This script](../scripts/offline_data_processing.py) provides the capability for users to perform standalone data
285-
preprocessing, decoupled from the tuning/training part. It processes raw datasets, performs data preprocessing, and
286-
saves the train and validation datasets (in shards if `--num_dataset_shards` if passed) in parquet format inside the specified `output_dir`.
287-
A data config YAML file can be used to pass configuration to this script. Example command to run this script:
288-
289-
```
290-
python scripts/offline_data_processing.py \
291-
--data_config_path /path/to/data_config.yaml \
292-
--model_name_or_path "model_name" \
293-
--max_seq_length 4096 \
294-
--output_dir /path/to/output/directory \
295-
--log_level info \
296-
--num_dataset_shards 3
297-
```
298-
299-
Example data config file:
300-
301-
```
302-
dataprocessor:
303-
type: default
304-
sampling_stopping_strategy: first_exhausted
305-
seed: 66
306-
datasets:
307-
- name: dataset_1
308-
data_paths:
309-
- tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl
310-
data_handlers:
311-
- name: tokenize_and_apply_input_masking
312-
arguments:
313-
remove_columns: all
314-
batched: false
315-
fn_kwargs:
316-
input_field_name: input
317-
output_field_name: output
318-
```
319-
320284

docs/ept.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ datasets:
4343
And the commandline passed to the library should include following.
4444

4545
```
46-
--data_config <path to the data config> --packing=True --max_seq_len 8192
46+
--data_config_path <path to the data config> --packing=True --max_seq_len 8192
4747
```
4848

4949
Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `Tweet` column before passing that as a dataset.
@@ -102,7 +102,7 @@ NOTE: More in-depth documentation of `sampling_stopping_strategy` and how to spe
102102
Here also the command line arguments would be
103103

104104
```
105-
--data_config <path to the data config> --packing=True --max_seq_len 8192
105+
--data_config_path <path to the data config> --packing=True --max_seq_len 8192
106106
```
107107

108108
The code again would add `EOS_TOKEN` to the non tokenized data before using it and also note that the `dataset_text_field` is assumed to be same across all datasets for now.
@@ -131,7 +131,7 @@ datasets:
131131
The command-line arguments passed to the library should include the following:
132132

133133
```
134-
--data_config <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
134+
--data_config_path <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
135135
```
136136

137137
Please note when using streaming, user must pass `max_steps` instead of `num_train_epochs`. See advanced data preprocessing [document](./advanced-data-preprocessing.md#data-streaming) for more info.

0 commit comments

Comments
 (0)