foundation-model-stack
diff --git a/‎.github/workflows/image.yaml‎
Lines changed: 1 addition & 2 deletions b/‎.github/workflows/image.yaml‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎README.md‎
Lines changed: 23 additions & 4 deletions b/‎README.md‎
Lines changed: 23 additions & 4 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tests/acceleration/test_acceleration_dataclasses.py‎
Lines changed: 8 additions & 0 deletions b/‎tests/acceleration/test_acceleration_dataclasses.py‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎tests/acceleration/test_acceleration_framework.py‎
Lines changed: 157 additions & 3 deletions b/‎tests/acceleration/test_acceleration_framework.py‎
Lines changed: 157 additions & 3 deletions
@@ -15,9 +15,8 @@ jobs:
           sudo swapoff -a
           sudo rm -f /swapfile
           sudo apt clean
-          docker rmi $(docker image ls -aq)
+          if [ "$(docker image ls -q)" ]; then docker rmi $(docker image ls -aq); fi
           df -h
       - name: Build image
         run: |
           docker build -t fms-hf-tuning:dev . -f build/Dockerfile
-          
@@ -1,7 +1,7 @@
 # FMS HF Tuning
 
 - [Installation](#installation)
-- [Data format](#data-format)
+- [Data format support](#data-support)
 - [Supported Models](#supported-models)
 - [Training](#training)
   - [Single GPU](#single-gpu)
@@ -62,13 +62,13 @@ pip install fms-hf-tuning[aim]
 For more details on how to enable and use the trackers, Please see, [the experiment tracking section below](#experiment-tracking).
 
 ## Data Support
-Users can pass training data in a single file using the `--training_data_path` argument along with other arguments required for various [use cases](#use-cases-supported-with-training_data_path-argument) (see details below) and the file can be in any of the [supported formats](#supported-data-formats). Alternatively, you can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly.
+Users can pass training data as either a single file or a Hugging Face dataset ID using the `--training_data_path` argument along with other arguments required for various [use cases](#use-cases-supported-with-training_data_path-argument) (see details below). If user choose to pass a file, it can be in any of the [supported formats](#supported-data-formats). Alternatively, you can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly.
 
 
 Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md).
 
 ## Supported Data Formats
-We support the following data formats via `--training_data_path` argument
+We support the following file formats via `--training_data_path` argument
 
 Data Format | Tested Support
 ------------|---------------
@@ -77,6 +77,8 @@ JSONL       |   ✅
 PARQUET     |   ✅
 ARROW       |   ✅
 
+As iterated above, we also support passing a HF dataset ID directly via `--training_data_path` argument.
+
 ## Use cases supported with `training_data_path` argument
 
 ### 1. Data formats with a single sequence and a specified response_template to use for masking on completion.
@@ -198,6 +200,10 @@ For advanced data preprocessing support including mixing and custom preprocessin
 Model Name & Size  | Model Architecture | Full Finetuning | Low Rank Adaptation (i.e. LoRA) | qLoRA(quantized LoRA) | 
 -------------------- | ---------------- | --------------- | ------------------------------- | --------------------- |
 Granite PowerLM 3B   | GraniteForCausalLM | ✅* | ✅* | ✅* |
+Granite 3.1 1B       | GraniteForCausalLM | ✔️* | ✔️* | ✔️* |
+Granite 3.1 2B       | GraniteForCausalLM | ✔️* | ✔️* | ✔️* |
+Granite 3.1 3B       | GraniteForCausalLM | ✔️* | ✔️* | ✔️* |
+Granite 3.1 8B       | GraniteForCausalLM | ✔️* | ✔️* | ✔️* |
 Granite 3.0 2B       | GraniteForCausalLM | ✔️* | ✔️* | ✔️* |
 Granite 3.0 8B       | GraniteForCausalLM | ✅* | ✅* | ✔️ |
 GraniteMoE 1B        | GraniteMoeForCausalLM  | ✅ | ✅** | ? |
@@ -217,7 +223,7 @@ Mixtral 8x7B                              | Mixtral   | ✅ | ✅ | ✅ |
 Mistral-7b                                | Mistral   | ✅ | ✅ | ✅ |  
 Mistral large                             | Mistral   | 🚫 | 🚫 | 🚫 | 
 
-(*) - Supported with `fms-hf-tuning` v2.0.1 or later
+(*) - Supported with `fms-hf-tuning` v2.4.0 or later.
 
 (**) - Supported for q,k,v,o layers . `all-linear` target modules does not infer on vLLM yet.
 
@@ -742,6 +748,8 @@ The list of configurations for various `fms_acceleration` plugins:
 - [attention_and_distributed_packing](./tuning/config/acceleration_configs/attention_and_distributed_packing.py):
   - `--padding_free`: technique to process multiple examples in single batch without adding padding tokens that waste compute.
   - `--multipack`: technique for *multi-gpu training* to balance out number of tokens processed in each device, to minimize waiting time.
+- [fast_moe_config](./tuning/config/acceleration_configs/fast_moe.py) (experimental):
+  - `--fast_moe`: trains MoE models in parallel, increasing throughput and decreasing memory usage.
 
 Notes: 
  * `quantized_lora_config` requires that it be used along with LoRA tuning technique. See [LoRA tuning section](https://github.com/foundation-model-stack/fms-hf-tuning/tree/main?tab=readme-ov-file#lora-tuning-example) on the LoRA parameters to pass.
@@ -760,6 +768,17 @@ Notes:
  * Notes on Multipack
     - works only for *multi-gpu*.
     - currently only includes the version of *multipack* optimized for linear attention implementations like *flash-attn*.
+ * Notes on Fast MoE
+    - `--fast_moe` is an integer value that configures the amount of expert parallel sharding (ep_degree).
+    - `world_size` must be divisible by the `ep_degree`
+    - Running fast moe modifies the state dict of the model, and must be post-processed using [checkpoint utils](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py) to run inference (HF, vLLM, etc.).
+      - The typical usecase for this script is to run:
+        ```
+        python -m fms_acceleration_moe.utils.checkpoint_utils \
+        <checkpoint file> \
+        <output file> \
+        <original model>
+        ```
 
 Note: To pass the above flags via a JSON config, each of the flags expects the value to be a mixed type list, so the values must be a list. For example:
 ```json
 
@@ -45,7 +45,7 @@ dev = ["wheel>=0.42.0,<1.0", "packaging>=23.2,<25", "ninja>=1.11.1.1,<2.0", "sci
 flash-attn = ["flash-attn>=2.5.3,<3.0"]
 aim = ["aim>=3.19.0,<4.0"]
 mlflow = ["mlflow"]
-fms-accel = ["fms-acceleration>=0.1"]
+fms-accel = ["fms-acceleration>=0.6"]
 gptq-dev = ["auto_gptq>0.4.2", "optimum>=1.15.0"]
 
 
 
@@ -28,6 +28,7 @@
     MultiPack,
     PaddingFree,
 )
+from tuning.config.acceleration_configs.fast_moe import FastMoe, FastMoeConfig
 from tuning.config.acceleration_configs.fused_ops_and_kernels import (
     FastKernelsConfig,
     FusedLoraConfig,
@@ -88,6 +89,13 @@ def test_dataclass_parse_successfully():
     )
     assert isinstance(cfg.multipack, MultiPack)
 
+    # 5. Specifing "--fast_moe" will parse an FastMoe class
+    parser = transformers.HfArgumentParser(dataclass_types=FastMoeConfig)
+    (cfg,) = parser.parse_args_into_dataclasses(
+        ["--fast_moe", "1"],
+    )
+    assert isinstance(cfg.fast_moe, FastMoe)
+
 
 def test_two_dataclasses_parse_successfully_together():
     """Ensure that the two dataclasses can parse arguments successfully
 
@@ -43,6 +43,7 @@
     MultiPack,
     PaddingFree,
 )
+from tuning.config.acceleration_configs.fast_moe import FastMoe, FastMoeConfig
 from tuning.config.acceleration_configs.fused_ops_and_kernels import (
     FastKernelsConfig,
     FusedLoraConfig,
@@ -56,7 +57,8 @@
 # for some reason the CI will raise an import error if we try to import
 # these from tests.artifacts.testdata
 TWITTER_COMPLAINTS_JSON_FORMAT = os.path.join(
-    os.path.dirname(__file__), "../artifacts/testdata/twitter_complaints_json.json"
+    os.path.dirname(__file__),
+    "../artifacts/testdata/json/twitter_complaints_small.json",
 )
 TWITTER_COMPLAINTS_TOKENIZED = os.path.join(
     os.path.dirname(__file__),
@@ -87,6 +89,10 @@
         # Third Party
         from fms_acceleration_aadp import PaddingFreeAccelerationPlugin
 
+    if is_fms_accelerate_available(plugins="moe"):
+        # Third Party
+        from fms_acceleration_moe import ScatterMoEAccelerationPlugin
+
 
 # There are more extensive unit tests in the
 # https://github.com/foundation-model-stack/fms-acceleration
@@ -360,7 +366,7 @@ def test_framework_raises_due_to_invalid_arguments(
     acceleration_configs_map,
     ids=["bitsandbytes", "auto_gptq"],
 )
-def test_framework_intialized_properly_peft(
+def test_framework_initialized_properly_peft(
     quantized_lora_config, model_name_or_path, mock_and_spy
 ):
     """Ensure that specifying a properly configured acceleration dataclass
@@ -412,7 +418,7 @@ def test_framework_intialized_properly_peft(
         "and foak plugins"
     ),
 )
-def test_framework_intialized_properly_foak():
+def test_framework_initialized_properly_foak():
     """Ensure that specifying a properly configured acceleration dataclass
     properly activates the framework plugin and runs the train sucessfully.
     """
@@ -477,6 +483,60 @@ def test_framework_intialized_properly_foak():
         assert spy2["get_ready_for_train_calls"] == 1
 
 
+@pytest.mark.skipif(
+    not is_fms_accelerate_available(plugins="moe"),
+    reason="Only runs if fms-accelerate is installed along with accelerated-moe plugin",
+)
+def test_framework_initialized_properly_moe():
+    """Ensure that specifying a properly configured acceleration dataclass
+    properly activates the framework plugin and runs the train sucessfully.
+    """
+
+    with tempfile.TemporaryDirectory() as tempdir:
+
+        model_args = copy.deepcopy(MODEL_ARGS)
+        model_args.model_name_or_path = "Isotonic/TinyMixtral-4x248M-MoE"
+        model_args.torch_dtype = torch.bfloat16
+        train_args = copy.deepcopy(TRAIN_ARGS)
+        train_args.output_dir = tempdir
+        train_args.save_strategy = "no"
+        train_args.bf16 = True
+        data_args = copy.deepcopy(DATA_ARGS)
+        data_args.training_data_path = TWITTER_COMPLAINTS_JSON_FORMAT
+        data_args.response_template = "\n\n### Label:"
+        data_args.dataset_text_field = "output"
+
+        # initialize a config
+        moe_config = FastMoeConfig(fast_moe=FastMoe(ep_degree=1))
+
+        # create mocked plugin class for spying
+        MockedPlugin1, spy = create_mock_plugin_class_and_spy(
+            "FastMoeMock", ScatterMoEAccelerationPlugin
+        )
+
+        # 1. mock a plugin class
+        # 2. register the mocked plugins
+        # 3. call sft_trainer.train
+        with build_framework_and_maybe_instantiate(
+            [
+                (["training.moe.scattermoe"], MockedPlugin1),
+            ],
+            instantiate=False,
+        ):
+            with instantiate_model_patcher():
+                sft_trainer.train(
+                    model_args,
+                    data_args,
+                    train_args,
+                    fast_moe_config=moe_config,
+                )
+
+        # spy inside the train to ensure that the ilab plugin is called
+        assert spy["model_loader_calls"] == 1
+        assert spy["augmentation_calls"] == 0
+        assert spy["get_ready_for_train_calls"] == 1
+
+
 @pytest.mark.skipif(
     not is_fms_accelerate_available(plugins="aadp"),
     reason="Only runs if fms-accelerate is installed along with \
@@ -661,6 +721,100 @@ def test_error_raised_with_fused_lora_enabled_without_quantized_argument():
                     )
 
 
+@pytest.mark.skipif(
+    not is_fms_accelerate_available(plugins="moe"),
+    reason="Only runs if fms-accelerate is installed along with accelerated-moe plugin",
+)
+def test_error_raised_with_undividable_fastmoe_argument():
+    """
+    Ensure error is thrown when `--fast_moe` is passed and world_size
+    is not divisible by ep_degree
+    """
+    with pytest.raises(
+        AssertionError, match="world size \\(1\\) not divisible by ep_size \\(3\\)"
+    ):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            model_args = copy.deepcopy(MODEL_ARGS)
+            model_args.model_name_or_path = "Isotonic/TinyMixtral-4x248M-MoE"
+            model_args.torch_dtype = torch.bfloat16
+            train_args = copy.deepcopy(TRAIN_ARGS)
+            train_args.output_dir = tempdir
+            train_args.save_strategy = "no"
+            train_args.bf16 = True
+            data_args = copy.deepcopy(DATA_ARGS)
+            data_args.training_data_path = TWITTER_COMPLAINTS_JSON_FORMAT
+            data_args.response_template = "\n\n### Label:"
+            data_args.dataset_text_field = "output"
+
+            # initialize a config
+            moe_config = FastMoeConfig(fast_moe=FastMoe(ep_degree=3))
+
+            # 1. mock a plugin class
+            # 2. register the mocked plugins
+            # 3. call sft_trainer.train
+            with build_framework_and_maybe_instantiate(
+                [
+                    (["training.moe.scattermoe"], ScatterMoEAccelerationPlugin),
+                ],
+                instantiate=False,
+            ):
+                with instantiate_model_patcher():
+                    sft_trainer.train(
+                        model_args,
+                        data_args,
+                        train_args,
+                        fast_moe_config=moe_config,
+                    )
+
+
+@pytest.mark.skipif(
+    not is_fms_accelerate_available(plugins="moe"),
+    reason="Only runs if fms-accelerate is installed along with accelerated-moe plugin",
+)
+def test_error_raised_fast_moe_with_non_moe_model():
+    """
+    Ensure error is thrown when `--fast_moe` is passed and model is not MoE
+    """
+    with pytest.raises(
+        AttributeError,
+        match="'LlamaConfig' object has no attribute 'num_local_experts'",
+    ):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            model_args = copy.deepcopy(MODEL_ARGS)
+            model_args.model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v0.3"
+            model_args.torch_dtype = torch.bfloat16
+            train_args = copy.deepcopy(TRAIN_ARGS)
+            train_args.output_dir = tempdir
+            train_args.save_strategy = "no"
+            train_args.bf16 = True
+            data_args = copy.deepcopy(DATA_ARGS)
+            data_args.training_data_path = TWITTER_COMPLAINTS_JSON_FORMAT
+            data_args.response_template = "\n\n### Label:"
+            data_args.dataset_text_field = "output"
+
+            # initialize a config
+            moe_config = FastMoeConfig(fast_moe=FastMoe(ep_degree=1))
+
+            # 1. mock a plugin class
+            # 2. register the mocked plugins
+            # 3. call sft_trainer.train
+            with build_framework_and_maybe_instantiate(
+                [
+                    (["training.moe.scattermoe"], ScatterMoEAccelerationPlugin),
+                ],
+                instantiate=False,
+            ):
+                with instantiate_model_patcher():
+                    sft_trainer.train(
+                        model_args,
+                        data_args,
+                        train_args,
+                        fast_moe_config=moe_config,
+                    )
+
+
 @pytest.mark.skipif(
     not is_fms_accelerate_available(plugins="foak"),
     reason="Only runs if fms-accelerate is installed along with \