foundation-model-stack
diff --git a/‎.pylintrc‎
Lines changed: 3 additions & 3 deletions b/‎.pylintrc‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎CODEOWNERS‎
Lines changed: 1 addition & 1 deletion b/‎CODEOWNERS‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 7 additions & 2 deletions b/‎CONTRIBUTING.md‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 144 additions & 62 deletions b/‎README.md‎
Lines changed: 144 additions & 62 deletions
diff --git a/‎build/Dockerfile‎
Lines changed: 3 additions & 1 deletion b/‎build/Dockerfile‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎build/accelerate_launch.py‎
Lines changed: 34 additions & 4 deletions b/‎build/accelerate_launch.py‎
Lines changed: 34 additions & 4 deletions
diff --git a/‎docs/advanced-data-preprocessing.md‎
Lines changed: 4 additions & 40 deletions b/‎docs/advanced-data-preprocessing.md‎
Lines changed: 4 additions & 40 deletions
diff --git a/‎docs/ept.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/ept.md‎
Lines changed: 3 additions & 3 deletions
@@ -281,7 +281,7 @@ ignored-parents=
 max-args=5
 
 # Maximum number of attributes for a class (custom).
-max-attributes=10
+max-attributes=15
 
 # Maximum number of boolean expressions in an if statement (see R0916).
 max-bool-expr=5
@@ -299,7 +299,7 @@ max-parents=7
 max-public-methods=20
 
 # Maximum number of return / yield for function / method body.
-max-returns=6
+max-returns=10
 
 # Maximum number of statements in function / method body.
 max-statements=50
@@ -475,7 +475,7 @@ notes-rgx=
 [REFACTORING]
 
 # Maximum number of nested blocks for function / method body
-max-nested-blocks=5
+max-nested-blocks=6
 
 # Complete name of functions that never returns. When checking for
 # inconsistent-return-statements if a never returning function is called then
 
@@ -8,4 +8,4 @@
 #  https://help.github.com/en/articles/about-code-owners
 #
 
-*  @anhuong @Ssukriti @aluu317 @fabianlim @kmehant
+*  @anhuong @dushyantbehl @aluu317 @fabianlim @kmehant
@@ -30,8 +30,11 @@ To contribute to this repo, you'll use the Fork and Pull model common in many op
 Guide](https://github.com/kubernetes/community/blob/master/contributors/guide/github-workflow.md)
 from Kubernetes.
 
-When your contribution is ready, you can create a pull request. Pull requests are often referred to as "PR". In general, we follow the standard [GitHub pull request](https://help.github.com/en/articles/about-pull-requests) process. Follow the template to provide details about your pull request to the maintainers. It's best to break your contribution into smaller PRs with incremental changes, and include a good description of the changes. 
-We require new unit tests to be contributed with any new functionality added. 
+When your contribution is ready, you can create a pull request. Pull requests are often referred to as "PR". In general, we follow the standard [GitHub pull request](https://help.github.com/en/articles/about-pull-requests) process. Follow the template to provide details about your pull request to the maintainers. 
+1. It's best to break your contribution into smaller PRs with incremental changes, and include a good description of the changes in the PR description. 
+2. We require new unit tests to be contributed with any new functionality added.
+3. We require each feature to be documented as part of the PR. If certain feature is experimental and not documented it will be announced as a dev preview.
+4. We require any new unit tests that are gated by conditions such as package availability must be executed, and details of those along with a screenshot of the test results should be included in the PR description.
 
 Before sending pull requests, make sure your changes pass formatting, linting and unit tests. These checks will run with the pull request builds. Alternatively, you can run the checks manually on your local machine [as specified below](#development).
 
@@ -50,6 +53,8 @@ Once you've [created a pull request](#how-can-i-contribute), maintainers will re
 - Follow the project coding conventions
 - Write detailed commit messages
 - Break large changes into a logical series of smaller patches, which are easy to understand individually and combine to solve a broader issue
+- Ensure documentation is added on `how to use` any new capabilities.
+- Ensure follow-up issues are created for documentation and that feature is not officially released without full documentation.
 
 Maintainers will perform "squash and merge" actions on PRs in this repo, so it doesn't matter how many commits your PR has, as they will end up being a single commit after merging.
 
 
@@ -149,16 +149,18 @@ RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
     python -m pip install --user wheel && \
     python -m pip install --user "$(head bdist_name)" && \
     python -m pip install --user "$(head bdist_name)[flash-attn]" && \
-    python -m pip install --user "$(head bdist_name)[mamba]"
+    python -m pip install --user --no-build-isolation "$(head bdist_name)[mamba]"
 
 # fms_acceleration_peft = PEFT-training, e.g., 4bit QLoRA
 # fms_acceleration_foak = Fused LoRA and triton kernels
 # fms_acceleration_aadp = Padding-Free Flash Attention Computation
+# fms_acceleration_moe = Parallelized Mixture of Experts
 RUN if [[ "${ENABLE_FMS_ACCELERATION}" == "true" ]]; then \
         python -m pip install --user "$(head bdist_name)[fms-accel]"; \
         python -m fms_acceleration.cli install fms_acceleration_peft; \
         python -m fms_acceleration.cli install fms_acceleration_foak; \
         python -m fms_acceleration.cli install fms_acceleration_aadp; \
+        python -m fms_acceleration.cli install fms_acceleration_moe; \
     fi
 
 RUN if [[ "${ENABLE_AIM}" == "true" ]]; then \
 
@@ -146,6 +146,17 @@ def main():
                     save_model_dir, save_model_dir, num_added_tokens
                 )
 
+            # In case of ScatterMoE LoRa
+            hf_converted_checkpoint = os.path.join(
+                save_model_dir, "hf_converted_checkpoint"
+            )
+            if os.path.exists(
+                os.path.join(hf_converted_checkpoint, "adapter_model.safetensors")
+            ):
+                post_process_vLLM_adapters_new_tokens(
+                    hf_converted_checkpoint, hf_converted_checkpoint, num_added_tokens
+                )
+
         if (
             os.path.exists(os.path.join(output_dir, "added_tokens_info.json"))
             and job_config.get("save_strategy") != "no"
@@ -159,11 +170,30 @@ def main():
             for _, dirs, _ in os.walk(output_dir, topdown=False):
                 for name in dirs:
                     if "checkpoint-" in name.lower():
-                        post_process_vLLM_adapters_new_tokens(
-                            os.path.join(output_dir, name),
-                            os.path.join(output_dir, name),
-                            num_added_tokens,
+                        base_checkpoint_dir = os.path.join(output_dir, name)
+                        hf_converted_checkpoint = os.path.join(
+                            base_checkpoint_dir, "hf_converted_checkpoint"
+                        )
+
+                        # Use hf_converted_checkpoint if exists, otherwise use base_checkpoint_dir
+                        checkpoint_dir = (
+                            hf_converted_checkpoint
+                            if os.path.exists(
+                                os.path.join(
+                                    hf_converted_checkpoint, "adapter_model.safetensors"
+                                )
+                            )
+                            else base_checkpoint_dir
                         )
+
+                        if os.path.exists(
+                            os.path.join(checkpoint_dir, "adapter_model.safetensors")
+                        ):
+                            post_process_vLLM_adapters_new_tokens(
+                                checkpoint_dir,
+                                checkpoint_dir,
+                                num_added_tokens,
+                            )
         else:
             logging.warning(
                 "Failed to post-process: file added_tokens_info.json not in path %s",
 
@@ -9,7 +9,7 @@ These things are supported via what we call a [`data_config`](#data-config) whic
 
 ## Data Config
 
-Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config` flag. In this 
+Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config_path` flag. In this
 configuration users can describe multiple datasets, configurations on how to load the datasets and configuration on how to 
 process the datasets. Users can currently pass both YAML or JSON based configuration files as data_configs.
 
@@ -255,7 +255,7 @@ Needless to say the sampling ratio of a datasets is a float and all the sampling
 We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to randomize the interleaving of datasets and a [`stopping_strategy`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy) to describe when to stop sampling. Both values should remain the same for experiment reproducibility. Both these values are common for all datasets and should be supplied at top level in the `datapreprocessor` as shown [above](#how-the-user-can-write-data-configs). For a list of the supported values of these arguments see the corresponding HF API.
 
 
-`Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset`
+Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
 
 ### Data Streaming
 Dataset streaming allows users to utilize the functionality of iterable datasets to pass in data piece by piece, avoiding memory constraints with large datasets for use-cases like extended pre-training.
@@ -271,6 +271,8 @@ dataprocessor:
 
 When using streaming, `split_batches` in the `TrainingArguments` will automatically be set to `True`, by doing so, the main process will fetch a full batch and slice it into `num_processes` batches for each process. This means that `num_processes` must be divisible by `batch_size`. This will replace the global batch size.
 
+Note: Streaming datasets or use of `IterableDatasets` is not compatible with the fms-acceleration multipack plugin because multipack sampler has to run thorugh the full dataset every epoch. Using multipack and streaming together will raise an error.
+
 **When using streaming, the user must set `max_steps` in the `TrainingArguments` instead of `num_train_epochs`.** Since iterable datasets are loaded chunk-by-chunk, data cannot run through epochs in a typical fashion as the **Trainer** can not know length of the dataset as it is being passed through. If both `max_steps` and `num_train_epochs` are given in a training config, `max_steps` will overwrite `num_train_epochs` since `max_steps` directly specifies the total number of optimization steps, which is needed when dataset length cannot be known. 
 
 If the dataset size is known to the user, `max_steps` can be calculated as the total number of samples divided by the batch size.
@@ -279,42 +281,4 @@ If the dataset size is known to the user, `max_steps` can be calculated as the t
 
 We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
 
-## Offline Data preprocessing
-
-[This script](../scripts/offline_data_processing.py) provides the capability for users to perform standalone data 
-preprocessing, decoupled from the tuning/training part. It processes raw datasets, performs data preprocessing, and 
-saves the train and validation datasets (in shards if `--num_dataset_shards` if passed) in parquet format inside the specified `output_dir`. 
-A data config YAML file can be used to pass configuration to this script. Example command to run this script:
-
-```
-python scripts/offline_data_processing.py \
---data_config_path  /path/to/data_config.yaml \
---model_name_or_path "model_name"  \
---max_seq_length 4096 \
---output_dir /path/to/output/directory  \
---log_level info \
---num_dataset_shards 3
-```
-
-Example data config file:
-
-```
-dataprocessor:
-    type: default
-    sampling_stopping_strategy: first_exhausted
-    seed: 66
-datasets:
-  - name: dataset_1
-    data_paths:
-      - tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl
-    data_handlers:
-      - name: tokenize_and_apply_input_masking
-        arguments:
-          remove_columns: all
-          batched: false
-          fn_kwargs:
-            input_field_name: input
-            output_field_name: output
-```
-
 
@@ -43,7 +43,7 @@ datasets:
 And the commandline passed to the library should include following.
 
 ```
---data_config <path to the data config> --packing=True --max_seq_len 8192
+--data_config_path <path to the data config> --packing=True --max_seq_len 8192
 ```
 
 Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `Tweet` column before passing that as a dataset.
@@ -102,7 +102,7 @@ NOTE: More in-depth documentation of `sampling_stopping_strategy` and how to spe
 Here also the command line arguments would be 
 
 ```
---data_config <path to the data config> --packing=True --max_seq_len 8192
+--data_config_path <path to the data config> --packing=True --max_seq_len 8192
 ```
 
 The code again would add `EOS_TOKEN` to the non tokenized data before using it and also note that the `dataset_text_field` is assumed to be same across all datasets for now.
@@ -131,7 +131,7 @@ datasets:
 The command-line arguments passed to the library should include the following:
 
 ```
---data_config <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
+--data_config_path <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
 ```
 
 Please note when using streaming, user must pass `max_steps` instead of `num_train_epochs`. See advanced data preprocessing [document](./advanced-data-preprocessing.md#data-streaming) for more info.
Original file line number	Diff line number	Diff line change
`@@ -8,4 +8,4 @@`
`8`	`8`	`# https://help.github.com/en/articles/about-code-owners`
`9`	`9`	`#`
`10`	`10`
`11`		`-* @anhuong @Ssukriti @aluu317 @fabianlim @kmehant`
	`11`	`+* @anhuong @dushyantbehl @aluu317 @fabianlim @kmehant`