NVIDIA-NeMo
diff --git a/‎docs/source/checkpoints/dist_ckpt.rst‎
Lines changed: 59 additions & 2 deletions b/‎docs/source/checkpoints/dist_ckpt.rst‎
Lines changed: 59 additions & 2 deletions
diff --git a/‎nemo/collections/diffusion/models/flux/model.py‎
Lines changed: 4 additions & 1 deletion b/‎nemo/collections/diffusion/models/flux/model.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎nemo/collections/llm/inference/base.py‎
Lines changed: 2 additions & 1 deletion b/‎nemo/collections/llm/inference/base.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎nemo/collections/llm/modelopt/distill/utils.py‎
Lines changed: 2 additions & 1 deletion b/‎nemo/collections/llm/modelopt/distill/utils.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎nemo/collections/llm/modelopt/prune/pruner.py‎
Lines changed: 5 additions & 1 deletion b/‎nemo/collections/llm/modelopt/prune/pruner.py‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎nemo/collections/llm/peft/api.py‎
Lines changed: 11 additions & 3 deletions b/‎nemo/collections/llm/peft/api.py‎
Lines changed: 11 additions & 3 deletions
diff --git a/‎nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py‎
Lines changed: 0 additions & 1 deletion b/‎nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎nemo/collections/speechlm/models/speech_to_text_llm_model.py‎
Lines changed: 4 additions & 2 deletions b/‎nemo/collections/speechlm/models/speech_to_text_llm_model.py‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎nemo/collections/vlm/neva/model/base.py‎
Lines changed: 4 additions & 2 deletions b/‎nemo/collections/vlm/neva/model/base.py‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎nemo/core/optim/mcore_optim.py‎
Lines changed: 16 additions & 5 deletions b/‎nemo/core/optim/mcore_optim.py‎
Lines changed: 16 additions & 5 deletions
@@ -330,16 +330,21 @@ dist_checkpointing.load_common_state_dict
 The ``dist_checkpointing.load_common_state_dict`` function is an entry point that allows loading only the “common” part of the checkpoints.
 Most of the checkpoint config and metadata can be loaded with this method, which allows skipping data loading in order to take decisions regarding checkpoint config, version, etc.
 
-dist_checkpointing.load_tensors_metadata
+dist_checkpointing.load_sharded_metadata
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The ``dist_checkpointing.load_tensors_metadata`` function is an entry point that allows reading all ShardedTensors metadata from the checkpoint without loading any data.
+The ``dist_checkpointing.load_sharded_metadata`` function is an entry point that allows reading all ShardedTensors metadata from the checkpoint without loading any data.
 The result is a sharded state dict with trivial sharding (every tensor is sharded into one big shard).
 
 dist_checkpointing.load_plain_tensors
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 The ``dist_checkpointing.load_plain_tensors`` function is an entry point that allows reading sharded tensors stored in the checkpoint without any sharding (as plain tensors).
 This function is simply a composition of ``load_tensors_metadata`` and ``save``.
 
+dist_checkpointing.load_content_metadata
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The ``dist_checkpointing.load_content_metadata`` function is an entry point that allows reading content versioning metadata saved during `save`.
+See `Checkpoint versioning`_ for more details.
+
 Save and Load Strategies
 ------------------------
 There are multiple ways to save a sharded state dict into a serialized checkpoint. They can be provided by the user as saving and loading strategies (e.g. ``TorchDistLoadShardedStrategy`` and ``TorchDistSaveShardedStrategy`` as shown below).
@@ -530,6 +535,50 @@ and using the ``dist_checkpointing.save`` and ``dist_checkpointing.load`` entryp
 In Megatron Core, the sharded state dictionary preparation is already implemented in a ``sharded_state_dict`` method which creates the sharded state dicts in a composable way.
 For other applications (e.g. with simpler types of supported parallelisms) it might be possible to apply a straightforward conversion from a regular model state dict into a sharded state dict.
 
+Checkpoint versioning
+^^^^^^^^^^^^^^^^^^^^^
+Megatron-Core v0.14 exposes ``content_metadata`` flag for the ``save`` routine which allows to store metadata describing the checkpoint content (and a corresponding `load_content_metadata` function for loading).
+In particular, this is the intended place to store application specific versioning information - ``dist_checkpointing`` doesn't interpret the metadata at any point.
+The idea behind this feature is to provide a way to access content identifying metadata without reading the whole checkpoint.
+Since loading a distributed checkpoint requires providing valid ShardedTensors to the ``load`` routine, in some cases it can be impossible
+to load the tensors from the checkpoint without using the content version to prepare the correct sharded state dict in advance.
+
+In Megatron-LM and NeMo frameworks, the whole content metadata is passed to ``shared_state_dict`` model and optimizer methods
+and therefore affects only the logic behind sharded_state_dict creation.
+The recommended versioning practice for those frameworks is to use content metadata only for ``sharded_state_dict`` behavior control,
+e.g. avoid storing metadata which affects framework logic in other way.
+The content metadata should be minimalistic (to avoid a bloated metadata with multiple possible configurations),
+ideally flat (or with a single nesting level) and with semantically meaningful flag names (e.g. ``distrib_optim_sharding_type`` or ``non_homogeneous_layers``).
+In particular, a simple integer (or SemVer) versioning flag (e.g. ``metadata['version'] = 3.4``) is discouraged,
+because the metadata serves for all models and optimizers and it's practically impossible to enforce a linearly increasing versioning for this whole space.
+
+In NeMo or Megatron-LM the versioning logic (calling ``sharded_state_dict`` method with appropriate metadata) is already implemented.
+In order to introduce a new checkpoint version, two steps are required:
+
+1. Add some new flag to the metadata which is passed to ``sharded_state_dict`` methods by the framework (e.g. ``metadata['model_X_layout_Y'] = True``).
+   E.g. in NeMo the metadata is determined in the ``MegatronStrategy.sharded_state_dict_metadata`` property.
+
+1. Handle the new flag in the appropriate ``sharded_state_dict`` method (in Megatron-Core or framework or user code).
+   **Make sure to keep the old logic in case the new flag is absent. This will ensure both the new and old checkpoints can be loaded correctly**.
+   This logic must be kept until the old checkpoint version is deprecated. Similarly with metadata flag removal. For example:
+
+   .. code-block:: python
+
+       def sharded_state_dict(..., metadata: Optional[dict] = None):
+           if (metadata or {}).get('model_X_layout_Y', False):
+               # new behavior
+           else:
+               # old behavior
+           if (metadata or {}).get('already_removed_flag', False):
+               # old behavior (!)
+           else:
+               # new behavior
+
+Note: Currently the content metadata is part of the "common" checkpoint state (and in consequence resides in ``common.pt`` file) but this is an implementation
+detail and could be changed in the future. Therefore it's recommended to save/load the content metadata with the API described at the beginning of this section.
+
+Note: currently in NeMo and Megatron-LM versioning content is stored only in global checkpoints. For local checkpoints,
+it is assumed that save and load content version are the same and thus `sharded_state_dict` uses runtime metadata in both cases.
 
 FAQs
 -----------------------
@@ -574,6 +623,14 @@ FAQs
 
    To accelerate checkpoint saving, it is recommended to set ``dist_ckpt_assume_constant_structure=True``.
 
+**9. Q: I get an error about an "invalid access pattern". What does it mean?**
+
+   A: The logs print the access pattern tensor count. Its shape corresponds to the ShardedTensor sharding grid
+   (e.g. 3-dimensional parameter sharded by TP along the 1st axis would have the access pattern tensor of shape ``(1, TP size, 1)``).
+   The tensor values correspond to the number of ShardedTensors with main ``replica_id`` corresponding to that shard.
+   A correct shared_state_dict definition results in an access pattern with 1s in each cell. Invalid access pattern usually
+   means an incorrect ShardedTensor sharding defined in the ``sharded_state_dict`` model method.
+
 
 Glossary
 -----------------------
 
@@ -370,7 +370,10 @@ def load_from_pretrained(
         if load_dist_ckpt:
             from megatron.core import dist_checkpointing
 
-            sharded_state_dict = dict(state_dict=self.sharded_state_dict(prefix="module."))
+            sharded_sd_metadata = dist_checkpointing.load_content_metadata(ckpt_path)
+            sharded_state_dict = dict(
+                state_dict=self.sharded_state_dict(prefix="module.", metadata=sharded_sd_metadata)
+            )
             loaded_state_dict = dist_checkpointing.load(
                 sharded_state_dict=sharded_state_dict, checkpoint_dir=ckpt_path
             )
 
@@ -180,7 +180,8 @@ def _setup_trainer_and_restore_model(
     peft: Optional[PEFT] = model.model_transform
     if isinstance(peft, PEFT):
         model = peft(model)
-        sharded_state_dict = MegatronModule.sharded_state_dict(model)
+        sharded_sd_metadata = trainer.strategy.unwrapped_checkpoint_io.load_content_metadata(path)
+        sharded_state_dict = MegatronModule.sharded_state_dict(model, metadata=sharded_sd_metadata)
         adapter_sharded_state_dict = {k: v for k, v in sharded_state_dict.items() if ".adapter." in k}
         adapter_state = trainer.strategy.checkpoint_io.load_checkpoint(
             ckpt_to_weights_subdir(path, is_saving=False), sharded_state_dict=adapter_sharded_state_dict
 
@@ -145,7 +145,8 @@ def teacher_provider(
     # TODO(aanoosheh): Replace spec with modelopt one
     model = config.configure_model(tokenizer)
 
-    sharded_state_dict = {"state_dict": model.sharded_state_dict(prefix="module.")}
+    sharded_sd_metadata = trainer.strategy.unwrapped_checkpoint_io.load_content_metadata(ckpt_path)
+    sharded_state_dict = {"state_dict": model.sharded_state_dict(prefix="module.", metadata=sharded_sd_metadata)}
     strict = trainer.strategy.ckpt_load_strictness
     checkpoint = trainer.strategy.checkpoint_io.load_checkpoint(ckpt_path, sharded_state_dict, strict=strict)
     state_dict = {k.replace("module.", ""): v for k, v in checkpoint["state_dict"].items()}
 
@@ -130,7 +130,11 @@ def save_pruned_model(trainer: nl.Trainer, save_path: str) -> None:
     # TODO: trainer.save_checkpoint(save_path) doesnt seem to save metadata.json or .metadata files!
     weight_path = ckpt_to_weights_subdir(save_path, is_saving=True)
     weight_path.mkdir(parents=True, exist_ok=True)
-    dist_checkpointing.save(trainer.strategy.megatron_parallel.sharded_state_dict(), weight_path)
+    dist_checkpointing.save(
+        trainer.strategy.megatron_parallel.sharded_state_dict(),
+        weight_path,
+        content_metadata=trainer.strategy.sharded_state_dict_metadata,
+    )
 
     if is_global_rank_zero():
         TrainerContext.from_trainer(trainer).io_dump(ckpt_to_context_subdir(save_path), yaml_attrs=["model"])
 
@@ -164,19 +164,27 @@ def _setup_trainer_and_restore_model_and_adapter(
     model.trainer = trainer
 
     lora(model)
+    weights_dir = ckpt_to_weights_subdir(lora_checkpoint_path, is_saving=False)
+    sharded_sd_metadata = trainer.strategy.unwrapped_checkpoint_io.load_content_metadata(weights_dir)
     adapter_sharded_state_dict = {
-        k: v for k, v in trainer.strategy.megatron_parallel.sharded_state_dict().items() if ".adapter." in k
+        k: v
+        for k, v in trainer.strategy.megatron_parallel.sharded_state_dict(metadata=sharded_sd_metadata).items()
+        if ".adapter." in k
     }
     adapter_state = trainer.strategy.checkpoint_io.load_checkpoint(
-        ckpt_to_weights_subdir(lora_checkpoint_path, is_saving=False), sharded_state_dict=adapter_sharded_state_dict
+        weights_dir, sharded_state_dict=adapter_sharded_state_dict
     )
     trainer.strategy.load_model_state_dict(adapter_state, strict=False)
 
 
 def _save_merged_weight(output_path: str, merged_weights: dict, model: pl.LightningModule, trainer: Trainer):
     weight_path = ckpt_to_weights_subdir(output_path, is_saving=True)
     Path(weight_path).mkdir(parents=True, exist_ok=True)
-    dist_checkpointing.save(merged_weights, str(ckpt_to_weights_subdir(output_path, is_saving=True)))
+    dist_checkpointing.save(
+        merged_weights,
+        str(ckpt_to_weights_subdir(output_path, is_saving=True)),
+        content_metadata=trainer.strategy.sharded_state_dict_metadata,
+    )
     if hasattr(model.tokenizer, "save_pretrained"):
         model.tokenizer.save_pretrained("/tmp/nemo_tokenizer")
         model.tokenizer = AutoTokenizer("/tmp/nemo_tokenizer")
 
@@ -2048,7 +2048,6 @@ def sharded_state_dict(self, prefix: str = '') -> Dict[str, Any]:
         self.state_dict().
         The sharded tensor mapping is defined in the GPTModel class from mcore.
         """
-
         if self.mcore_gpt:
             module_prefix = f'{prefix}model.'
             sharded_state_dict = {}
 
@@ -218,11 +218,13 @@ def _maybe_load_pretrained_llm(self, model: MCoreGPTModel, strict: bool = False)
             llm_model_cls(self.language_model_config), f"{self.language_model_hub}{ckpt_path}", on_import_ckpt=False
         )
 
-        sharded_state_dict = dict(state_dict=model.sharded_state_dict(prefix="module."))
+        load_path = ckpt_to_weights_subdir(ckpt_path, is_saving=False)
+        sharded_sd_metadata = dist_checkpointing.load_content_metadata(load_path)
+        sharded_state_dict = dict(state_dict=model.sharded_state_dict(prefix="module.", metadata=sharded_sd_metadata))
 
         loaded_state_dict = dist_checkpointing.load(
             sharded_state_dict=sharded_state_dict,
-            checkpoint_dir=ckpt_to_weights_subdir(ckpt_path, is_saving=False),
+            checkpoint_dir=load_path,
             validate_access_integrity=False,
             **({"strict": "log_all"} if not strict else {}),
         )
 
@@ -83,10 +83,12 @@ def restore_model_weights(model, checkpoint_path, strict=False):
         strict: Whether to restore weights even if they are not the same.
     """
     if checkpoint_path is not None:
-        sharded_state_dict = dict(state_dict=model.sharded_state_dict(prefix="module."))
+        weights_dir = ckpt_to_weights_subdir(checkpoint_path, is_saving=False)
+        sharded_sd_metadata = dist_checkpointing.load_content_metadata(weights_dir)
+        sharded_state_dict = dict(state_dict=model.sharded_state_dict(prefix="module.", metadata=sharded_sd_metadata))
         loaded_state_dict = dist_checkpointing.load(
             sharded_state_dict=sharded_state_dict,
-            checkpoint_dir=ckpt_to_weights_subdir(checkpoint_path, is_saving=False),
+            checkpoint_dir=weights_dir,
             validate_access_integrity=False,
             **({"strict": "log_all"} if not strict else {}),
         )
 
@@ -14,6 +14,7 @@
 
 import torch
 
+from nemo.utils import logging
 from nemo.utils.nvtx import nvtx_range_pop, nvtx_range_push
 
 
@@ -94,7 +95,12 @@ def load_state_dict(self, state_dict):
         self.mcore_optimizer.load_state_dict(state_dict)
 
     def sharded_state_dict(
-        self, model_sharded_state_dict, optimizer_state_dict=None, is_loading=False, dist_ckpt_parallel_save=False
+        self,
+        model_sharded_state_dict,
+        optimizer_state_dict=None,
+        is_loading=False,
+        dist_ckpt_parallel_save=None,
+        **kwargs,
     ):
         """
         Returns the sharded state dictionary for distributed checkpointing.
@@ -109,10 +115,15 @@ def sharded_state_dict(
         Returns:
             dict: The sharded optimizer state dictionary.
         """
-        sharding_type = 'fully_sharded_model_space' if dist_ckpt_parallel_save else 'dp_zero_gather_scatter'
-        return self.mcore_optimizer.sharded_state_dict(
-            model_sharded_state_dict, is_loading=is_loading, sharding_type=sharding_type
-        )
+        if dist_ckpt_parallel_save is not None:
+            logging.warning(
+                "dist_ckpt_parallel_save is deprecated, please use `metadata['distrib_optim_sharding_type']`"
+                " to specify DistributedOptimizer format details instead."
+            )
+            kwargs['sharding_type'] = (
+                'fully_sharded_model_space' if dist_ckpt_parallel_save else 'dp_zero_gather_scatter'
+            )
+        return self.mcore_optimizer.sharded_state_dict(model_sharded_state_dict, is_loading=is_loading, **kwargs)
 
     def step(self, closure=None):
         """
Original file line number	Diff line number	Diff line change
`@@ -218,11 +218,13 @@ def _maybe_load_pretrained_llm(self, model: MCoreGPTModel, strict: bool = False)`
`218`	`218`	`llm_model_cls(self.language_model_config), f"{self.language_model_hub}{ckpt_path}", on_import_ckpt=False`
`219`	`219`	`)`
`220`	`220`
`221`		`- sharded_state_dict = dict(state_dict=model.sharded_state_dict(prefix="module."))`
	`221`	`+ load_path = ckpt_to_weights_subdir(ckpt_path, is_saving=False)`
	`222`	`+ sharded_sd_metadata = dist_checkpointing.load_content_metadata(load_path)`
	`223`	`+ sharded_state_dict = dict(state_dict=model.sharded_state_dict(prefix="module.", metadata=sharded_sd_metadata))`
`222`	`224`
`223`	`225`	`loaded_state_dict = dist_checkpointing.load(`
`224`	`226`	`sharded_state_dict=sharded_state_dict,`
`225`		`- checkpoint_dir=ckpt_to_weights_subdir(ckpt_path, is_saving=False),`
	`227`	`+ checkpoint_dir=load_path,`
`226`	`228`	`validate_access_integrity=False,`
`227`	`229`	`**({"strict": "log_all"} if not strict else {}),`
`228`	`230`	`)`