NVIDIA-NeMo
diff --git a/‎3rdparty/Megatron-Bridge‎ b/‎3rdparty/Megatron-Bridge‎
diff --git a/‎README.md‎
Lines changed: 174 additions & 29 deletions b/‎README.md‎
Lines changed: 174 additions & 29 deletions
diff --git a/‎dfm/src/common/utils/save_video.py‎
Lines changed: 0 additions & 1 deletion b/‎dfm/src/common/utils/save_video.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎dfm/src/megatron/data/common/diffusion_energon_datamodule.py‎
Lines changed: 5 additions & 1 deletion b/‎dfm/src/megatron/data/common/diffusion_energon_datamodule.py‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎dfm/src/megatron/data/common/diffusion_sample.py‎
Lines changed: 17 additions & 7 deletions b/‎dfm/src/megatron/data/common/diffusion_sample.py‎
Lines changed: 17 additions & 7 deletions
diff --git a/‎dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py‎
Lines changed: 2 additions & 2 deletions b/‎dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎dfm/src/megatron/data/common/sequence_packing_utils.py‎
Lines changed: 0 additions & 32 deletions b/‎dfm/src/megatron/data/common/sequence_packing_utils.py‎
Lines changed: 0 additions & 32 deletions
diff --git a/‎dfm/src/megatron/data/dit/dit_mock_datamodule.py‎
Lines changed: 9 additions & 4 deletions b/‎dfm/src/megatron/data/dit/dit_mock_datamodule.py‎
Lines changed: 9 additions & 4 deletions
diff --git a/‎dfm/src/megatron/data/dit/dit_taskencoder.py‎
Lines changed: 4 additions & 5 deletions b/‎dfm/src/megatron/data/dit/dit_taskencoder.py‎
Lines changed: 4 additions & 5 deletions
@@ -1,30 +1,175 @@
-# NeMo DFM: Diffusion Foundation Models collection
-
-NeMo DFM is a state-of-the-art framework for fast, large-scale training and inference of video world models. It unifies the latest diffusion-based and autoregressive techniques, prioritizing efficiency and performance from research prototyping to production deployment.
-
-## Projects
-
-This collection consists of 4 projects:
-1. [Scalable diffusion training framework](nemo_vfm/diffusion/readme.rst)
-2. [Accelerated diffusion world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/diffusion/README.md)
-3. [Accelerated autoregressive world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/autoregressive/README.md)
-4. [Sparse attention for efficient diffusion inference](nemo_vfm/sparse_attention/README.md)
-
-## Citations
-
-If you find our code useful, please consider citing the following papers:
-```bibtex
-@article{patel2025training,
-  title={Training Video Foundation Models with NVIDIA NeMo},
-  author={Patel, Zeeshan and He, Ethan and Mannan, Parth and Ren, Xiaowei and Wolf, Ryan and Agarwal, Niket and Huffman, Jacob and Wang, Zhuoyao and Wang, Carl and Chang, Jack and others},
-  journal={arXiv preprint arXiv:2503.12964},
-  year={2025}
-}
-
-@article{agarwal2025cosmos,
-  title={Cosmos world foundation model platform for physical ai},
-  author={Agarwal, Niket and Ali, Arslan and Bala, Maciej and Balaji, Yogesh and Barker, Erik and Cai, Tiffany and Chattopadhyay, Prithvijit and Chen, Yongxin and Cui, Yin and Ding, Yifan and others},
-  journal={arXiv preprint arXiv:2501.03575},
-  year={2025}
-}
+<div align="center">
+
+# NeMo DFM: Diffusion Foundation Models
+
+
+<!-- We are still using Mbridge CICD NeMo. @pablo can we get our own? and the same for star gazer-->
+
+<!-- Not includeing codecov for now since we have not worked on it extensively-->
+
+[![CICD NeMo](https://github.com/NVIDIA-NeMo/DFM/actions/workflows/cicd-main.yml/badge.svg)](https://github.com/NVIDIA-NeMo/DFM/actions/workflows/cicd-main.yml)
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
+[![GitHub Stars](https://img.shields.io/github/stars/NVIDIA-NeMo/DFM.svg?style=social&label=Star&cacheSeconds=14400)](https://github.com/NVIDIA-NeMo/DFM/stargazers/)
+
+[Documentation](https://github.com/NVIDIA-NeMo/DFM/tree/main/docs) | [Supported Models](#supported-models) | [Examples](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples) | [Contributing](https://github.com/NVIDIA-NeMo/DFM/tree/main/CONTRIBUTING.md)
+
+</div>
+
+## Overview
+
+NeMo DFM (Diffusion Foundation Models) is a library under [NeMo Framework](https://github.com/NVIDIA-NeMo), focusing on diffusion models for **Video**, **Image**, and **Text** generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment.
+
+**Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility:
+
+- **🌉 Megatron Bridge Path**: Built on [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with n-D parallelism (TP, PP, CP, EP, VPP, DP)
+- **🚀 AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training, for easy experimentation and also Day-0 support on 🤗 Hugging Face models.
+
+Choose the path that best fits your workflow—or use both for different stages of development!
+
+<!-- Once we have updated images of how DFM fits into NeMo journey. Put them here. @Eliiot can help.-->
+## 🔧 Installation
+
+### 🐳 Build your own Container
+
+#### 1. Build the container
+```bash
+# Initialize all submodules (Megatron-Bridge, Automodel, and nested Megatron-LM)
+git submodule update --init --recursive
+
+# Build the container
+docker build -f docker/Dockerfile.ci -t dfm:dev .
+```
+
+#### 2. Start the container
+
+```bash
+docker run --rm -it --gpus all \
+  --entrypoint bash \
+  -v $(pwd):/opt/DFM -it dfm:dev
+```
+
+
+
+### 📦 Using DFM Docker (Coming Soon)
+
+## ⚡ Quickstart
+
+### Megatron Bridge Path
+
+#### Run a Recipe
+You can find all predefined recipes under [recipes](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/megatron/recipes) directory.
+
+> **Note:** You will have to use [uv](https://docs.astral.sh/uv/) to run the recipes. Please use `--group` as `megatron-bridge`.
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
+  --config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \
+  --training-mode pretrain \
+  --mock
+```
+
+### AutoModel Path
+
+Train with PyTorch-native DTensor parallelism and direct 🤗 HF integration:
+
+#### Run a Recipe
+
+You can find pre-configured recipes under [automodel/finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune) and [automodel/pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain) directories.
+
+> Note: AutoModel examples live under `dfm/examples/automodel`. Use [uv](https://docs.astral.sh/uv/) with `--group automodel`. Configs are YAML-driven; pass `-c <path>` to override the default.
+
+The fine-tune recipe sets up WAN 2.1 Text-to-Video training with Flow Matching using FSDP2 Hybrid Sharding.
+It parallelizes heavy transformer blocks while keeping lightweight modules (e.g., VAE) unsharded for efficiency.
+Adjust batch sizes, LR, and parallel sizes in `dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml`.
+The generation script demonstrates distributed inference with AutoModel DTensor managers, producing an MP4 on rank 0. You can tweak frame size, frames, steps, and CFG in flags.
+
+```bash
+# Fine-tune WAN 2.1 T2V with FSDP2 (single node, 8 GPUs)
+uv run --group automodel torchrun --nproc-per-node=8 \
+  dfm/examples/automodel/finetune/finetune.py \
+  -c dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml
+
+# Generate videos with FSDP2 (distributed inference)
+uv run --group automodel torchrun --nproc-per-node=8 \
+  dfm/examples/automodel/generate/wan_generate.py
 ```
+
+## 🚀 Key Features
+
+### Dual Training Paths
+
+**Megatron Bridge** delivers maximum throughput and scalability with near-linear performance to thousands of nodes. **AutoModel** provides an easy on-ramp for experimentation and research with PyTorch-native SPMD training.
+
+### Shared Capabilities
+
+- **🎥 Multi-Modal Diffusion**: Support for video, image, and text generation
+- **🔬 Advanced Samplers**: EDM, Flow Matching, and custom diffusion schedules
+- **🎭 Flexible Architectures**: DiT (Diffusion Transformers), WAN (World Action Networks)
+- **📊 Efficient Data Loading**: Data pipelines with sequence packing
+- **💾 Distributed Checkpointing**: SafeTensors-based sharded checkpoints
+- **🌟 Memory Optimization**: Gradient checkpointing, mixed precision, efficient attention
+- **🤗 HuggingFace Integration**: Seamless integration with the HF ecosystem
+
+## Supported Models
+
+DFM provides out-of-the-box support for state-of-the-art diffusion architectures:
+
+| Model | Type | Megatron Bridge | AutoModel | Description |
+|-------|------|-----------------|-----------|-------------|
+| **DiT** | Image/Video | [pretrain](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/pretrain_dit_model.py), [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py)  | 🔜 | Diffusion Transformers with scalable architecture |
+| **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py) | [pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain), [finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune),[inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/automodel/generate/wan_validate.py) | World Action Networks for video generation |
+
+## Performance Benchmarking
+
+For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance-summary.md] in our documentation.
+
+## Project Structure
+
+```
+DFM/
+├── dfm/
+│   └── src/
+│       ├── megatron/              # Megatron Bridge path
+│       │   ├── base/              # Base utilities for Megatron
+│       │   ├── data/              # Data loaders and task encoders
+│       │   │   ├── common/        # Shared data utilities
+│       │   │   ├── <model_name>/  # model-specific data handling
+│       │   ├── model/             # Model implementations
+│       │   │   ├── common/        # Shared model components
+│       │   │   ├── <model_name>/  # model-specific implementations
+│       │   └── recipes/           # Training recipes
+│       │       ├── <model_name>/  # model-specific training configs
+│       ├── automodel              # AutoModel path (DTensor-native)
+│       │   ├── _diffusers/        # Diffusion pipeline integrations
+│       │   ├── datasets/          # Dataset implementations
+│       │   ├── distributed/       # Parallelization strategies
+│       │   ├── flow_matching/     # Flow matching implementations
+│       │   ├── recipes/           # Training scripts
+│       │   └── utils/             # Utilities and validation
+│       └── common/                # Shared across both paths
+│           ├── data/              # Common data utilities
+│           └── utils/             # Batch ops, video utils, etc.
+├── examples/                      # Example scripts and configs
+```
+
+## 🤝 Contributing
+
+We welcome contributions! Please see our Contributing Guide for details on:
+
+- Setting up your development environment
+- Code style and testing guidelines
+- Submitting pull requests
+- Reporting issues
+
+For questions or discussions, please open an issue on GitHub.
+
+## Acknowledgements
+
+NeMo DFM builds upon the excellent work of:
+
+- [Megatron-core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) - Advanced model parallelism
+- [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) - HuggingFace ↔ Megatron bridge
+- [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) - PyTorch-native SPMD training
+- [PyTorch Distributed](https://pytorch.org/docs/stable/distributed.html) - Foundation for distributed training
+- [Diffusers](https://github.com/huggingface/diffusers) - Diffusion model implementations
@@ -44,5 +44,4 @@ def save_video(
         "output_params": ["-f", "mp4"],
     }
 
-    print("video_save_path", video_save_path)
     imageio.mimsave(video_save_path, grid, "mp4", **kwargs)
@@ -55,7 +55,11 @@ def __post_init__(self):
         self.sequence_length = self.dataset.seq_length
 
     def build_datasets(self, context: DatasetBuildContext):
-        return self.dataset.train_dataloader(), self.dataset.val_dataloader(), self.dataset.test_dataloader()
+        return (
+            iter(self.dataset.train_dataloader()),
+            iter(self.dataset.val_dataloader()),
+            iter(self.dataset.val_dataloader()),
+        )
 
 
 class DiffusionDataModule(EnergonMultiModalDataModule):
 
@@ -80,25 +80,35 @@ def to_dict(self) -> dict:
     def __add__(self, other: Any) -> int:
         """Adds the sequence length of this sample with another sample or integer."""
         if isinstance(other, DiffusionSample):
-            # Combine the values of the two instances
-            return self.seq_len_q.item() + other.seq_len_q.item()
+            # Use padded length if available (for CP), otherwise use unpadded
+            self_len = self.seq_len_q_padded.item() if self.seq_len_q_padded is not None else self.seq_len_q.item()
+            other_len = other.seq_len_q_padded.item() if other.seq_len_q_padded is not None else other.seq_len_q.item()
+            return self_len + other_len
         elif isinstance(other, int):
-            # Add an integer to the value
-            return self.seq_len_q.item() + other
+            # Use padded length if available (for CP), otherwise use unpadded
+            self_len = self.seq_len_q_padded.item() if self.seq_len_q_padded is not None else self.seq_len_q.item()
+            return self_len + other
         raise NotImplementedError
 
     def __radd__(self, other: Any) -> int:
         """Handles reverse addition for summing with integers."""
         # This is called if sum or other operations start with a non-DiffusionSample object.
         # e.g., sum([DiffusionSample(1), DiffusionSample(2)]) -> the 0 + DiffusionSample(1) calls __radd__.
         if isinstance(other, int):
-            return self.seq_len_q.item() + other
+            # Use padded length if available (for CP), otherwise use unpadded
+            self_len = self.seq_len_q_padded.item() if self.seq_len_q_padded is not None else self.seq_len_q.item()
+            return self_len + other
         raise NotImplementedError
 
     def __lt__(self, other: Any) -> bool:
         """Compares this sample's sequence length with another sample or integer."""
         if isinstance(other, DiffusionSample):
-            return self.seq_len_q.item() < other.seq_len_q.item()
+            # Use padded length if available (for CP), otherwise use unpadded
+            self_len = self.seq_len_q_padded.item() if self.seq_len_q_padded is not None else self.seq_len_q.item()
+            other_len = other.seq_len_q_padded.item() if other.seq_len_q_padded is not None else other.seq_len_q.item()
+            return self_len < other_len
         elif isinstance(other, int):
-            return self.seq_len_q.item() < other
+            # Use padded length if available (for CP), otherwise use unpadded
+            self_len = self.seq_len_q_padded.item() if self.seq_len_q_padded is not None else self.seq_len_q.item()
+            return self_len < other
         raise NotImplementedError
@@ -56,7 +56,7 @@ def __init__(
         self,
         *args,
         max_frames: int = None,
-        text_embedding_padding_size: int = 512,
+        text_embedding_max_length: int = 512,
         seq_length: int = None,
         patch_spatial: int = 2,
         patch_temporal: int = 1,
@@ -65,7 +65,7 @@ def __init__(
     ):
         super().__init__(*args, **kwargs)
         self.max_frames = max_frames
-        self.text_embedding_padding_size = text_embedding_padding_size
+        self.text_embedding_max_length = text_embedding_max_length
         self.seq_length = seq_length
         self.patch_spatial = patch_spatial
         self.patch_temporal = patch_temporal
 
@@ -71,35 +71,3 @@ def first_fit_decreasing(seqlens: List[int], pack_size: int) -> List[List[int]]:
     """
     sorted_seqlens = sorted(seqlens, reverse=True)
     return first_fit(sorted_seqlens, pack_size)
-
-
-def concat_pad(tensor_list, max_seq_length):
-    """
-    Efficiently concatenates a list of tensors along the first dimension and pads with zeros
-    to reach max_seq_length.
-
-    Args:
-        tensor_list (list of torch.Tensor): List of tensors to concatenate and pad.
-        max_seq_length (int): The desired size of the first dimension of the output tensor.
-
-    Returns:
-        torch.Tensor: A tensor of shape [max_seq_length, ...], where ... represents the remaining dimensions.
-    """
-    import torch
-
-    # Get common properties from the first tensor
-    other_shape = tensor_list[0].shape[1:]
-    dtype = tensor_list[0].dtype
-    device = tensor_list[0].device
-
-    # Initialize the result tensor with zeros
-    result = torch.zeros((max_seq_length, *other_shape), dtype=dtype, device=device)
-
-    current_index = 0
-    for tensor in tensor_list:
-        length = tensor.shape[0]
-        # Directly assign the tensor to the result tensor without checks
-        result[current_index : current_index + length] = tensor
-        current_index += length
-
-    return result
@@ -113,7 +113,7 @@ def mock_batch(
         seq_len_kv=seq_len_kv_packed,
         seq_len_kv_padded=seq_len_kv_padded_packed,
         latent_shape=torch.tensor([[C, T, H, W] for _ in range(number_packed_samples)], dtype=torch.int32),
-        pos_ids=pos_ids_packed,
+        pos_ids=pos_ids_packed.unsqueeze(0),
         video_metadata=[{"caption": f"Mock video sample {i}"} for i in range(number_packed_samples)],
     )
 
@@ -131,16 +131,19 @@ class DiTMockDataModuleConfig(DatasetProvider):
     dataloader_type: str = "external"
     task_encoder_seq_length: int = None
     F_latents: int = 1
-    H_latents: int = 64
-    W_latents: int = 96
+    H_latents: int = 256
+    W_latents: int = 512
     patch_spatial: int = 2
     patch_temporal: int = 1
-    number_packed_samples: int = 3
+    number_packed_samples: int = 1
     context_seq_len: int = 512
     context_embeddings_dim: int = 1024
 
     def __post_init__(self):
         mock_ds = _MockDataset(length=1024)
+        kwargs = {}
+        if self.num_workers > 0:
+            kwargs["prefetch_factor"] = 8
         self._train_dl = DataLoader(
             mock_ds,
             batch_size=self.micro_batch_size,
@@ -157,6 +160,8 @@ def __post_init__(self):
             ),
             shuffle=False,
             drop_last=False,
+            pin_memory=True,
+            **kwargs,
         )
         self._train_dl = iter(self._train_dl)
         self.sequence_length = self.seq_length
 
@@ -31,9 +31,9 @@ class DiTTaskEncoder(DiffusionTaskEncoderWithSequencePacking):
     Attributes:
         cookers (list): A list of Cooker objects used for processing.
         max_frames (int, optional): The maximum number of frames to consider from the video. Defaults to None.
-        text_embedding_padding_size (int): The padding size for text embeddings. Defaults to 512.
+        text_embedding_max_length (int): The maximum length for text embeddings. Defaults to 512.
     Methods:
-        __init__(*args, max_frames=None, text_embedding_padding_size=512, **kwargs):
+        __init__(*args, max_frames=None, text_embedding_max_size=512, **kwargs):
             Initializes the BasicDiffusionTaskEncoder with optional maximum frames and text embedding padding size.
         encode_sample(sample: dict) -> dict:
             Encodes a given sample dictionary containing video and text data.
@@ -71,7 +71,6 @@ def encode_sample(self, sample: dict) -> DiffusionSample:
             // self.patch_spatial**2
             // self.patch_temporal
         )
-        is_image = T == 1
 
         if seq_len > self.seq_length:
             print(f"Skipping sample {sample['__key__']} because seq_len {seq_len} > self.seq_length {self.seq_length}")
@@ -100,8 +99,8 @@ def encode_sample(self, sample: dict) -> DiffusionSample:
         t5_text_embeddings = torch.from_numpy(sample["pickle"]).to(torch.bfloat16)
         t5_text_embeddings_seq_length = t5_text_embeddings.shape[0]
 
-        if t5_text_embeddings_seq_length > self.text_embedding_padding_size:
-            t5_text_embeddings = t5_text_embeddings[: self.text_embedding_padding_size]
+        if t5_text_embeddings_seq_length > self.text_embedding_max_length:
+            t5_text_embeddings = t5_text_embeddings[: self.text_embedding_max_length]
         t5_text_mask = torch.ones(t5_text_embeddings_seq_length, dtype=torch.bfloat16)
 
         pos_ids = rearrange(
Original file line number	Diff line number	Diff line change
`@@ -44,5 +44,4 @@ def save_video(`
`44`	`44`	`"output_params": ["-f", "mp4"],`
`45`	`45`	`}`
`46`	`46`
`47`		`- print("video_save_path", video_save_path)`
`48`	`47`	`imageio.mimsave(video_save_path, grid, "mp4", **kwargs)`