NVIDIA-NeMo · pablo-garay · Dec 3, 2025 · Nov 16, 2025 · Nov 16, 2025 · Nov 18, 2025
diff --git a/README.md b/README.md
@@ -1,30 +1,193 @@
-# NeMo DFM: Diffusion Foundation Models collection
-
-NeMo DFM is a state-of-the-art framework for fast, large-scale training and inference of video world models. It unifies the latest diffusion-based and autoregressive techniques, prioritizing efficiency and performance from research prototyping to production deployment.
-
-## Projects
-
-This collection consists of 4 projects:
-1. [Scalable diffusion training framework](nemo_vfm/diffusion/readme.rst)
-2. [Accelerated diffusion world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/diffusion/README.md)
-3. [Accelerated autoregressive world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/autoregressive/README.md)
-4. [Sparse attention for efficient diffusion inference](nemo_vfm/sparse_attention/README.md)
-
-## Citations
-
-If you find our code useful, please consider citing the following papers:
-```bibtex
-@article{patel2025training,
-  title={Training Video Foundation Models with NVIDIA NeMo},
-  author={Patel, Zeeshan and He, Ethan and Mannan, Parth and Ren, Xiaowei and Wolf, Ryan and Agarwal, Niket and Huffman, Jacob and Wang, Zhuoyao and Wang, Carl and Chang, Jack and others},
-  journal={arXiv preprint arXiv:2503.12964},
-  year={2025}
-}
-
-@article{agarwal2025cosmos,
-  title={Cosmos world foundation model platform for physical ai},
-  author={Agarwal, Niket and Ali, Arslan and Bala, Maciej and Balaji, Yogesh and Barker, Erik and Cai, Tiffany and Chattopadhyay, Prithvijit and Chen, Yongxin and Cui, Yin and Ding, Yifan and others},
-  journal={arXiv preprint arXiv:2501.03575},
-  year={2025}
-}
+<div align="center">
+
+# NeMo DFM: Diffusion Foundation Models
+
+
+<!-- We are still using Mbridge CICD NeMo. @pablo can we get our own? and the same for star gazer-->
+
+<!-- Not includeing codecov for now since we have not worked on it extensively-->
+
+[![CICD NeMo](https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/workflows/cicd-main.yml/badge.svg)](https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/workflows/cicd-main.yml)
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
+[![GitHub Stars](https://img.shields.io/github/stars/NVIDIA-NeMo/DFM.svg?style=social&label=Star&cacheSeconds=14400)](https://github.com/NVIDIA-NeMo/DFM/stargazers/)
+
+[Documentation](https://github.com/NVIDIA-NeMo/DFM/tree/main/docs) | [Supported Models](#supported-models) | [Examples](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples) | [Contributing](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/CONTRIBUTING.md)
+
+</div>
+
+## Overview
+
+NeMo DFM (Diffusion Foundation Models) is a comprehensive collection of diffusion models for **Video**, **Image**, and **Text** generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment.
+
+**Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility:
+
+- **🌉 Megatron Bridge Path**: Built on [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with 6D parallelism (TP, PP, CP, EP, VPP, DP)
+- **🚀 AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training with seamless 🤗 Hugging Face integration
+
+Choose the path that best fits your workflow—or use both for different stages of development!
+
+<!-- Once we have updated images of how DFM fits into NeMo journey. Put them here. @Eliiot can help.-->
+## 🔧 Installation
+
+### 🐳 Built your own Container
+
+#### 1. Build the container
+```bash
+# Initialize all submodules (Megatron-Bridge, Automodel, and nested Megatron-LM)
+git submodule update --init --recursive
+
+# Build the container
+docker build -f docker/Dockerfile.ci -t dfm:dev .
 ```
+
+#### 2. Start the container
+
+```bash
+docker run --rm -it --gpus all \
+  --entrypoint bash \
+  -v $(pwd):/opt/DFM -it dfm:dev
+```
+
+
+
+### 📦 Using DFM Docker (Coming Soon)
+
+## ⚡ Quickstart
+
+### Megatron Bridge Path
+
+#### Run a Receipe
+You can find all predefined recipes under [recipes](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/megatron/recipes) directory.
+
+> **Note:** You will have to use [uv](https://docs.astral.sh/uv/) to run the recipes. Please use `--group` as `megatron-bridge`.
+
+
+<!-- @Huy please update the below command after you change defaults-->
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc_per_node=2 examples/megatron/recipes/wan/pretrain_wan.py model.qkv_format=thd --mock
+```
+
+### AutoModel Path
+
+Train with PyTorch-native DTensor parallelism and direct 🤗 HF integration:
+
+<!-- @Linnan, @Alex please add this thanks a ton-->
+```bash
+# TODO
+# Fine-tune a video diffusion model with FSDP2
+uv run torchrun --nproc-per-node=8 \
+  dfm/src/automodel/recipes/finetune.py \
+  -c examples/automodel/wan21_finetune.yaml
+
+# Pre-train a video diffusion model with FSDP2
+uv run torchrun --nproc-per-node=8 \
+examples/automodel/pretrain/pretrain.py  \
+-c examples/automodel/pretrain/wan2_1_t2v_flow.yaml
+```
+
+## 🚀 Key Features
+
+### Dual Training Paths
+
+- **Megatron Bridge Path**
+  -  State-of-the-art performance optimizations (TFLOPs)
+  - 🎯 Advanced parallelism: Tensor (TP), Context (CP) Data (DP), etc
+  - 📈 Near-linear scalability to thousands of nodes
+  - 🔧 Production-ready recipes with optimized hyperparameters
+
+- **AutoModel Path**
+  - 🌐 PyTorch DTensor-native SPMD training
+  - 🔀 FSDP2-based Hybrid Sharding Data Parallelism (HSDP)
+  - 📦 Sequence packing for efficient training
+  - 🎨 Minimal ceremony with YAML-driven configs
+
+### Shared Capabilities
+
+- **🎥 Multi-Modal Diffusion**: Support for video, image, and text generation
+- **🔬 Advanced Samplers**: EDM, Flow Matching, and custom diffusion schedules
+- **🎭 Flexible Architectures**: DiT (Diffusion Transformers), WAN (World Action Networks)
+- **📊 Efficient Data Loading**: Data pipelines with sequence packing
+- **💾 Distributed Checkpointing**: SafeTensors-based sharded checkpoints
+- **🌟 Memory Optimization**: Gradient checkpointing, mixed precision, efficient attention
+- **🤗 HuggingFace Integration**: Seamless integration with the HF ecosystem
+
+## Supported Models
+
+DFM provides out-of-the-box support for state-of-the-art diffusion architectures:
+
+| Model | Type | Megatron Bridge | AutoModel | Description |
+|-------|------|-----------------|-----------|-------------|
+| **DiT** | Image/Video | [pretrain](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/pretrain_dit_model.py), [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py)  | 🔜 | Diffusion Transformers with scalable architecture |
+| **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py), conversion(@Huy) | [pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain), [finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune),[inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/automodel/generate/wan_validate.py) | World Action Networks for video generation |
+
+## Performance Benchmarking
+
+For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance-summary.md] in our documentation.
+
+## Project Structure
+
+```
+DFM/
+├── dfm/
+│   └── src/
+│       ├── megatron/              # Megatron Bridge path
+│       │   ├── base/              # Base utilities for Megatron
+│       │   ├── data/              # Data loaders and task encoders
+│       │   │   ├── common/        # Shared data utilities
+│       │   │   ├── <model_name>/  # model-specific data handling
+│       │   ├── model/             # Model implementations
+│       │   │   ├── common/        # Shared model components
+│       │   │   ├── <model_name>/  # model-specific implementations
+│       │   └── recipes/           # Training recipes
+│       │       ├── <model_name>/  # model-specific training configs
+│       ├── automodel              # AutoModel path (DTensor-native)
+│       │   ├── _diffusers/        # Diffusion pipeline integrations
+│       │   ├── datasets/          # Dataset implementations
+│       │   ├── distributed/       # Parallelization strategies
+│       │   ├── flow_matching/     # Flow matching implementations
+│       │   ├── recipes/           # Training scripts
+│       │   └── utils/             # Utilities and validation
+│       └── common/                # Shared across both paths
+│           ├── data/              # Common data utilities
+│           └── utils/             # Batch ops, video utils, etc.
+├── examples/                      # Example scripts and configs
+```
+
+## 🎯 Choosing Your Path
+
+| Feature | Megatron Bridge | AutoModel |
+|---------|-----------------|-----------|
+| **Best For** | Maximum scale (1000+ GPUs) | Flexibility & fast iteration |
+| **Parallelism** | 6D (TP, CP, DP, etc) | FSDP2, TP, SP, CP |
+| **HF Integration** | Via bridge/conversion | PyTorch-native DTensor |
+| **Checkpoint Format** | Megatron + HF export | SafeTensors DCP |
+| **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) |
+| **Performance** | Highest at scale | Excellent, pytorch-native |
+
+**Recommendation**:
+- Start with **AutoModel** for quick prototyping and HF model compatibility
+- Move to **Megatron Bridge** when scaling to 100+ GPUs or need advanced parallelism
+- Use **both**: prototype with AutoModel, scale with Megatron Bridge!
+
+
+## 🤝 Contributing
+
+We welcome contributions! Please see our Contributing Guide for details on:
+
+- Setting up your development environment
+- Code style and testing guidelines
+- Submitting pull requests
+- Reporting issues
+
+For questions or discussions, please open an issue on GitHub.
+
+## Acknowledgements
+
+NeMo DFM builds upon the excellent work of:
+
+- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) - Advanced model parallelism
+- [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) - HuggingFace ↔ Megatron bridge
+- [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) - PyTorch-native SPMD training
+- [PyTorch Distributed](https://pytorch.org/docs/stable/distributed.html) - Foundation for distributed training
+- [Diffusers](https://github.com/huggingface/diffusers) - Diffusion model implementations
diff --git a/docs/performance-summary.md b/docs/performance-summary.md
@@ -0,0 +1,65 @@
+# Performance
+
+As part of the NVIDIA NeMo Framework, DFM, provides the most recent training techniques for training advanced generative AI models, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput.
+
+This page provides the current performance benchmarks for large language models using DFM across different GPU systems and configurations as we continue to optimize the model for optimal performance. Please refer to `examples/megatron/recipes/wan/conf` for updated YAML configurations.
+
+## Nomenclature
+
+- **GBS**: Global Batch Size
+- **MBS**: Micro Batch Size
+- **FSDP**: Fully Sharded Data Parallel
+  - FSDP = 1: use FSDP
+  - FSDP = 0: use DDP (Distributed Data Parallel)
+- **TP**: Tensor Parallel Size
+- **SP**: Sequence Parallel
+- **PP**: Pipeline Parallel Size
+- **CP**: Context Parallel Size
+- **VP**: Virtual Pipeline Parallel Size
+- **EP**: Expert Parallel Size
+
+## Performance Metrics
+
+Performance is measured using:
+- **Tokens/sec/GPU**: Throughput per GPU
+- **Model TFLOP/sec/GPU**: Model floating-point operations per second per GPU
+
+```{contents}
+:local:
+:depth: 2
+```
+
+## Performance Summary for Large Language Models
+
+Below are performance benchmarks for various large language models organized by release version.
+
+The performance data includes:
+
+- **Pre-training Performance**: Throughput metrics for various model sizes and architectures
+- **System Configurations**: Results across different GPU systems (DGX-GB200, DGX-GB300, DGX-H100)
+
+---
+
+## Megatron-Core Pre-Training Performance
+
+#### System: DGX-GB200
+
+| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
+|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
+|Wan 2.1 14B|32|64|1|37440|0|1|0|1|4|0|0|787.59|
+
+
+#### System: DGX-GB300
+
+| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
+|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
+|Wan 2.1 14B|32|64|1|37440|0|1|0|1|2|0|0|1,022.26|
+
+#### System: DGX-H100
+
+| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
+|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
+|Wan 2.1 14B|128|128|1|37440|0|2|1|1|4|0|0|325.77|
+
+## Automodel Pre-Training Performance
+