foundation-model-stack
diff --git a/‎plugins/accelerated-moe/README.md‎
Lines changed: 19 additions & 2 deletions b/‎plugins/accelerated-moe/README.md‎
Lines changed: 19 additions & 2 deletions
diff --git a/‎plugins/accelerated-moe/configs/scattermoe.yaml‎
Lines changed: 8 additions & 34 deletions b/‎plugins/accelerated-moe/configs/scattermoe.yaml‎
Lines changed: 8 additions & 34 deletions
diff --git a/‎plugins/accelerated-moe/src/fms_acceleration_moe/framework_plugin_scattermoe.py‎
Lines changed: 9 additions & 4 deletions b/‎plugins/accelerated-moe/src/fms_acceleration_moe/framework_plugin_scattermoe.py‎
Lines changed: 9 additions & 4 deletions
diff --git a/‎plugins/accelerated-moe/src/fms_acceleration_moe/utils/__init__.py‎
Lines changed: 4 additions & 3 deletions b/‎plugins/accelerated-moe/src/fms_acceleration_moe/utils/__init__.py‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py‎
Lines changed: 55 additions & 21 deletions b/‎plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py‎
Lines changed: 55 additions & 21 deletions
diff --git a/‎plugins/accelerated-moe/src/fms_acceleration_moe/utils/scattermoe.py‎
Lines changed: 2 additions & 2 deletions b/‎plugins/accelerated-moe/src/fms_acceleration_moe/utils/scattermoe.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎plugins/accelerated-moe/src/fms_acceleration_moe/utils/scattermoe_constants.py‎
Lines changed: 3 additions & 5 deletions b/‎plugins/accelerated-moe/src/fms_acceleration_moe/utils/scattermoe_constants.py‎
Lines changed: 3 additions & 5 deletions
diff --git a/‎sample-configurations/moe-scattermoe-granite-ep1-padding-free-sample-configuration.yaml‎
Lines changed: 7 additions & 32 deletions b/‎sample-configurations/moe-scattermoe-granite-ep1-padding-free-sample-configuration.yaml‎
Lines changed: 7 additions & 32 deletions
@@ -16,7 +16,25 @@ Plugin | Description | Depends | Loading | Augmentation | Callbacks
 Our `ScatterMoe` implementation is a module-swap; to add new models we need to update the specifications in [scattermoe_constants.py](./src/fms_acceleration_moe/utils/scattermoe_constants.py).
 - See the code documentation within to understand how to add new models.
 
-### Code Extracted from Megablocks
+### Conversion of ScatterMoE
+
+`ScatterMoE` checkpoints are saved using `torch.distributed.checkpoint` (DCP) and which is by default `StateDictType.SHARDED_STATE_DICT`:
+- `DTensors` limited support for full state dicts. 
+- sharded state dicts are the extremely efficient, and require little comms overhead when saving.
+
+We provide a script to recover back the original checkpoint:
+- currently the script be used if the DSP saves a single `pytorch_model_fsdp_0` folder
+- say the checkpoint is stored in `hf/checkpoint-10`, then call
+
+    ```
+    python -m fms_acceleration_moe.utils.checkpoint_utils \
+    hf/checkpoint-10/pytorch_model_fsdp_0 \
+    output_dir mistralai/Mixtral-8x7B-Instruct-v0.1
+    ```
+
+
+
+## Code Extracted from Megablocks
 
 Notes on code extraction:
 - we have only extracted two `autograd` functions [GatherOp](https://github.com/databricks/megablocks/blob/main/megablocks/ops/gather.py) and [ScatterOp](https://github.com/databricks/megablocks/blob/main/megablocks/ops/scatter.py),
@@ -71,6 +89,5 @@ These are currently some known issues not yet resolved:
 - The design currently does a swap for the mixture-of-expert module with [ScatterMoE](./src/fms_acceleration_moe/utils/scattermoe.py). This affects the `state_dict` of the model, so any saved checkpoint may need to be converted back to original.
 - should eventually remove the dependency on an external `kernel-hyperdrive` repository.
 - now support only loading *sharded* `safetensor` non-GGUF MoE checkpoints. This is a reasonable assumption since MoE checkpoints are typically above the size limit that prevents it being saved into a single checkpoint filed.
-- currently only supports `StateDictType.SHARDED_STATE_DICT` because the implementation uses `DTensors` which have limited support for full state dicts. However for efficiency considerations, sharded state dicts are the most efficient. 
 
 
@@ -6,37 +6,11 @@ training:
     # expert-parallel for MoE
     scattermoe:
 
-      # TODO: should we even get rid of this?
-      # The name of the mixture-of-experts class
-      # moe_component_class: MixtralSparseMoeBlock
-      # moe_component_class: GraniteMoeMoE
-
-      # The module name of the router in moe_component_class above
-      # moe_gate_module_name: gate
-
-      # The module name of the experts in moe_component_class above
-      # moe_experts_module_name: experts
-
-      # the mlp version
-      # - for those with only up and down projs, use "v1"
-      # - for those with only up, down and gate projs, use "v2"
-      # moe_mlp_impl: v2
-    
-      # if True, then we shard experts across data parallel dimension
-      # - only feasible if world_size divides the number of experts
-      # shard_along_dp: true
-
-      # to be specified only if shard_along_dp == False. This will influence
-      # the level of sharding, which indicates how many experts per device
-      # - the number of experts per device will be num_experts / ep_size
-      # - we disable the ability to set ep_size=1 since this means no sharding
-      # - NOTE: ep_size=1 does not mean shard_along_dp=True, which would otherwise
-      #   be contradictory since ep_size suggests no expert parallel.
-      ep_degree: 1
-
-      # the MoE dropless implementation. Currently we only support "dropless_sparse", but
-      # in the future we may support others
-      # moe_implementation: dropless_sparse
-
-      # for load_balancing_loss
-      # load_balancing_loss: false
+      # The level of expert parallel sharding. 
+      # - 1 means no sharding
+      # - if > 1, please ensure that this divides the world_size. This is because
+      #   the devices will be replicated for every ep_degree devices, and 
+      #   the experts will be sharded within each group.
+      # - if > 1, also ensure that it divides the number of experts, as each device
+      #   will then have num_of_experts / ep_degree experts.
+      ep_degree: 1
@@ -31,10 +31,14 @@
 # pylint: disable=too-many-instance-attributes
 class ScatterMoEAccelerationPlugin(AccelerationPlugin):
 
-    # NOTE: its not packaged properly so, "importlib.util.find_spec('khd')"
-    # returns but "importlib.metadata.version('kernel-hyperdrive') is needed"
-    # require_packages = {"khd"}
-    # NOTE: will address this later if we remove the dependency on kernel-hyperdrive
+    # NOTE: we cannot do 
+    # - require_packages = {"khd"}
+    # this is because the khd fork is not properly packaged as a PyPI project, and so
+    # - "importlib.util.find_spec('khd')" returns, but 
+    # - "importlib.metadata.version('kernel-hyperdrive')" does not return 
+    # if we decide to extract the kernels, then we do not need to anymore, 
+    # https://github.com/foundation-model-stack/fms-acceleration/issues/105
+
     restricted_model_archs = ["GraniteMoeForCausalLM", "MixtralForCausalLM"]
 
     def __init__(self, configurations: Dict[str, Dict]):
@@ -75,6 +79,7 @@ def model_loader(self, model_name: str, **kwargs):
         # NOTE: there is currently no good way to get the mixed precision
         # flag from train_args. It will be better to handle this if
         # when we move the sharding to augmentation.
+        # https://github.com/foundation-model-stack/fms-acceleration/issues/103
 
         return model
 
 
@@ -18,10 +18,11 @@
 
 # this is a special patch function to disable foreach for
 # dtensors, which has been introduced since torch 2.4.
-# The reason is because this will cause problems in the optimizer
-# lerp.
-
+# The reason is because this will cause problems in the optimizer 
+# RuntimeError: aten._foreach_mul_.Scalar: got mixed torch.Tensor and DTensor, 
+# need to convert all torch.Tensor to DTensor before calling distributed operators!
 
+# - this function patches torch
 def patch_torch_optim_foreach_to_not_apply_to_dtensors():
     # guarded.
     # this is an array of supported types, we will remove
 
@@ -46,13 +46,19 @@
 # - variable to capture the model variable
 #   in the save/load model calls
 MODEL_INDEX = None
+KEY_MODEL = 'model'
+KEY_OPTIMIZER = 'optimizer'
 
-# Below are rewrite of functions to be able to handle dtensors
-
+# Below are rewrite of HF FSDP model saving functions to be able to handle
+# that the parameters are now a mixture of regular and Dtensors.
+# - these functions are found in accelerate.utils.fsdp_utils.py
+# - save_fsdp_model, save_fsdp_optimizer, load_fsdp_model, load_fsdp_optimizer
+# NOTE: we will observe warnings such as
+# /torch/distributed/checkpoint/state_dict.py:520: 
+# FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
 
 # rewrite of func from accelerate.utils.fsdp_utils.py
-# - empty function, as main logic is in the optimizer call
-#  save_fsdp_optimizer (see below).
+# - empty function, the main logic will be in save_fsdp_optimizer (see below).
 def save_fsdp_model(
     fsdp_plugin, accelerator, model, output_dir, model_index=0, adapter_only=False
 ):
@@ -62,7 +68,7 @@ def save_fsdp_model(
 
 
 # rewrite of func from accelerate.utils.fsdp_utils.py
-# - saves both model and optimizer
+# - saves both model and optimizer at the same time
 def save_fsdp_optimizer(
     fsdp_plugin, accelerator, optimizer, model, output_dir, optimizer_index=0
 ):
@@ -80,7 +86,7 @@ def save_fsdp_optimizer(
     os.makedirs(ckpt_model, exist_ok=True)
     logger.info(f"Saving model to {ckpt_model}")
     dcp.save(
-        state_dict={"model": model_state_dict},
+        state_dict={KEY_MODEL: model_state_dict},
         storage_writer=dcp.FileSystemWriter(ckpt_model),
         planner=DefaultSavePlanner(),
     )
@@ -91,16 +97,15 @@ def save_fsdp_optimizer(
     os.makedirs(ckpt_opt, exist_ok=True)
     logger.info(f"Saving Optimizer state to {ckpt_opt}")
     dcp.save(
-        state_dict={"optimizer": optimizer_state_dict},
+        state_dict={KEY_OPTIMIZER: optimizer_state_dict},
         storage_writer=dcp.FileSystemWriter(ckpt_opt),
         planner=DefaultSavePlanner(),
     )
     logger.info(f"Optimizer state saved in {ckpt_opt}")
 
 
 # rewrite of func from accelerate.utils.fsdp_utils.py
-# - empty function, as main logic is in the optimizer call
-#  load_fsdp_optimizer (see below).
+# - empty function, main logic in load_fsdp_optimizer (see below).
 def load_fsdp_model(
     fsdp_plugin, accelerator, model, input_dir, model_index=0, adapter_only=False
 ):
@@ -133,15 +138,15 @@ def load_fsdp_optimizer(
     # - load the model state dict
     ckpt_model = os.path.join(input_dir, f"{FSDP_MODEL_NAME}_{MODEL_INDEX}")
     dcp.load(
-        state_dict={"model": model_state_dict},
+        state_dict={KEY_MODEL: model_state_dict},
         storage_reader=dcp.FileSystemReader(ckpt_model),
         planner=DefaultLoadPlanner(),
     )
 
     # - load the optimizer state dict
     ckpt_opt = os.path.join(input_dir, f"{OPTIMIZER_NAME}_{optimizer_index}")
     dcp.load(
-        state_dict={"optimizer": optimizer_state_dict},
+        state_dict={KEY_OPTIMIZER: optimizer_state_dict},
         storage_reader=dcp.FileSystemReader(ckpt_opt),
         planner=DefaultLoadPlanner(),
     )
@@ -154,10 +159,15 @@ def load_fsdp_optimizer(
         optim_state_dict=optimizer_state_dict,
     )
 
-    # HACK for now
-    # - if seems that if params is empty, then the loading has someo
-    #    problems
-    # - so for now, we just dump some random defaults
+    # FIXME: 
+    # - We see errors that occur in optimizer.step()
+    # - torch/optim/optimizer.py", line 89, in _use_grad
+    # - torch/optim/adamw.py", line 214, in step beta1, beta2 = cast(Tuple[float, float], group["betas"])
+    # - KeyError: 'betas'
+    # - Fortunately, this seems to be limited to the empty groups case, where
+    #   it seems that it is just the params are not initialized. Since we suppose
+    #   these groups are never used, we simply initialize the empty groups with 
+    #   random values so the errors do not throw.
     for group in optimizer.param_groups:
         if len(group["params"]) == 0:
             group["betas"] = (0.9, 0.999)
@@ -182,8 +192,8 @@ def patch_huggingface_save_and_load_for_dtensors():
     patch_target_module("transformers.trainer.load_fsdp_optimizer", load_fsdp_optimizer)
 
 
-# trick to get the resolved cache file to acccess the safetensor
-# NOTE: this does not work if _dict_from_json_file, like GGUF files
+# this function implements a trick to get the resolved cache file to acccess the safetensor
+# - NOTE: does not work if _dict_from_json_file is not called, such as in the case of GGUF files.
 def get_resolved_checkpoint_location(model_name_or_path: str):
 
     result = None
@@ -201,14 +211,17 @@ def _dict_from_json_file(resolved_config_file):
     return os.path.dirname(result)
 
 
-def restore_scattermoe_checkpoint_to_orig(
+# function to get the ScatterMoE state dict from its DCP checkpoint
+# - if the original pretrained_model_name_or_path is specified, will use the checkpoint as hints
+#   to map the ScatterMoE checkpoint to that of the original model. This is useful so that we 
+#   can restore the checkpoint to be loaded by the original architecture.
+def get_scattermoe_state_dict(
     dcp_checkpoint_dir: str,
     pretrained_model_name_or_path: str = None,
-    dcp_outer_key: str = "model",
 ):
     """
     Parameters:
-        dcp_checkpoint_dir (str): the dcp to be converted.
+        dcp_checkpoint_dir (str): the DCP to be converted.
         pretrained_model_name_or_path (str): Optional, if provided we will
             use the hints to remap the
     """
@@ -230,7 +243,7 @@ def restore_scattermoe_checkpoint_to_orig(
         planner=_EmptyStateDictLoadPlanner(),
         no_dist=True,
     )
-    sd = sd[dcp_outer_key]
+    sd = sd[KEY_MODEL]
 
     # if not provided
     if pretrained_model_name_or_path is None:
@@ -401,6 +414,16 @@ def _infer_prefixes_and_module_names(
         )
     )
 
+    parser.add_argument(
+        "dcp_checkpoint_dir",
+        help="Path to the distributed checkpoint.",
+    )
+
+    parser.add_argument(
+        "output_dir",
+        help="Path to the location to write the converted checkpoint."
+    )
+
     parser.add_argument(
         "pretrained_model_name_or_path",
         help=(
@@ -409,3 +432,14 @@ def _infer_prefixes_and_module_names(
             "checkpoint is obtained)."
         ),
     )
+
+    args = parser.parse_args()
+
+    # get the converted statedict
+    sd = get_scattermoe_state_dict(
+        args.dcp_checkpoint_dir,
+        args.pretrained_model_name_or_path
+    )
+
+    # save it
+    torch.save(sd, args.output_dir)
@@ -39,7 +39,7 @@
     ) from e
 
 # Local
-from .scattermoe_constants import SCATTERMOE_SPEC_HAS_GATE_WEIGHT
+from .scattermoe_constants import SCATTERMOE_SPEC_HAS_GATE
 from .scattermoe_utils import all_to_all_gather_inputs, scatter_with_routing_weights
 
 
@@ -306,7 +306,7 @@ def __init__(
             device=device,
             lora_config=lora_config,
         )
-        if mlp_arch == SCATTERMOE_SPEC_HAS_GATE_WEIGHT:
+        if mlp_arch == SCATTERMOE_SPEC_HAS_GATE:
             self.w3 = ScatteredExperts(
                 in_features=self.hidden_size,
                 out_features=self.intermediate_size,
 
@@ -29,7 +29,7 @@
 # Currently out ScatterMoE drop supports an up/down proj, and
 # and optional gate_proj.
 # - When new architectures are supported this list will update
-SCATTERMOE_SPEC_HAS_GATE_WEIGHT = "has_gate_proj"
+SCATTERMOE_SPEC_HAS_GATE = "Gated"
 
 # - moe_cls
 # - router_name
@@ -66,21 +66,19 @@
         "MixtralSparseMoeBlock",
         "gate",
         "experts",
-        SCATTERMOE_SPEC_HAS_GATE_WEIGHT,
+        SCATTERMOE_SPEC_HAS_GATE,
         True,
     ),
     "GraniteMoeForCausalLM": (
         "GraniteMoeMoE",
         "router",
         "input_linear|output_linear|input_linear",
-        SCATTERMOE_SPEC_HAS_GATE_WEIGHT,
+        SCATTERMOE_SPEC_HAS_GATE,
         False,
     ),
 }
 
 #  helper function to get the spec based on architectures
-
-
 def get_scattermoe_conv_spec_from_archs(architectures: List[str]):
     # infer the spec
     for archs, spec in SCATTERMOE_CONVERSION_SPEC.items():
 
@@ -18,36 +18,11 @@ plugins:
       # expert-parallel for MoE
       scattermoe:
 
-        # TODO: should we even get rid of this?
-        # The name of the mixture-of-experts class
-        # moe_component_class: MixtralSparseMoeBlock
-        # moe_component_class: GraniteMoeMoE
-
-        # The module name of the router in moe_component_class above
-        # moe_gate_module_name: gate
-
-        # The module name of the experts in moe_component_class above
-        # moe_experts_module_name: experts
-
-        # the mlp version
-        # - for those with only up and down projs, use "v1"
-        # - for those with only up, down and gate projs, use "v2"
-        # moe_mlp_impl: v2
-        # if True, then we shard experts across data parallel dimension
-        # - only feasible if world_size divides the number of experts
-        # shard_along_dp: true
-
-        # to be specified only if shard_along_dp == False. This will influence
-        # the level of sharding, which indicates how many experts per device
-        # - the number of experts per device will be num_experts / ep_size
-        # - we disable the ability to set ep_size=1 since this means no sharding
-        # - NOTE: ep_size=1 does not mean shard_along_dp=True, which would otherwise
-        #   be contradictory since ep_size suggests no expert parallel.
+        # The level of expert parallel sharding. 
+        # - 1 means no sharding
+        # - if > 1, please ensure that this divides the world_size. This is because
+        #   the devices will be replicated for every ep_degree devices, and 
+        #   the experts will be sharded within each group.
+        # - if > 1, also ensure that it divides the number of experts, as each device
+        #   will then have num_of_experts / ep_degree experts.
         ep_degree: 1
-
-        # the MoE dropless implementation. Currently we only support "dropless_sparse", but
-        # in the future we may support others
-        # moe_implementation: dropless_sparse
-
-        # for load_balancing_loss
-        # load_balancing_loss: false