Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
3266077
Initial README commit
abhinavg4 Nov 16, 2025
9add867
Update README and add performance summary documentation
abhinavg4 Nov 16, 2025
79f9d26
add DiT megatron links.
sajadn Nov 18, 2025
b96cf8f
Performance Docs update
parthmannan Nov 19, 2025
2b00158
Performance Docs update fix
parthmannan Nov 19, 2025
8e471a0
Update README to enhance clarity and accuracy
abhinavg4 Nov 20, 2025
6f92b01
Enhance README with detailed performance optimizations and parallelis…
abhinavg4 Nov 20, 2025
2233811
Update perf doc
parthmannan Nov 20, 2025
60fae1d
Merge branch 'readme_init' of github.com:NVIDIA-NeMo/DFM into readme_…
parthmannan Nov 20, 2025
88ddbf1
update
linnanwang Nov 21, 2025
2aaae5e
Update README with fine-tuning command
linnanwang Nov 21, 2025
9abba18
Apply suggestion from @akoumpa
akoumpa Nov 21, 2025
22c6790
Apply suggestion from @akoumpa
akoumpa Nov 21, 2025
49c8a24
Apply suggestion from @akoumpa
akoumpa Nov 21, 2025
10433b3
Update README, Wan-related.
huvunvidia Nov 21, 2025
901174e
Apply suggestion from @akoumpa
akoumpa Nov 21, 2025
03560c7
Fixing typo @akoumpa
akoumpa Nov 21, 2025
ca6d9cf
fix automodel section
akoumpa Nov 21, 2025
4b38e3d
fix
akoumpa Nov 21, 2025
b628d48
update DFM-specific readme
pablo-garay Nov 24, 2025
48d65a6
Update performance-summary.md
akoumpa Nov 26, 2025
fec3b40
Update performance-summary.md
akoumpa Nov 26, 2025
df982be
Update performance-summary.md
akoumpa Nov 26, 2025
796103e
Update README.md
abhinavg4 Dec 1, 2025
9ea6116
Update README.md
abhinavg4 Dec 1, 2025
ebf00bf
Update README.md
abhinavg4 Dec 1, 2025
7083f86
Update README.md
abhinavg4 Dec 1, 2025
f6f3a30
Refactor README.md and performance-summary.md for clarity and concise…
abhinavg4 Dec 1, 2025
31e7def
Merge branch 'main' into readme_init
abhinavg4 Dec 1, 2025
f86c51e
Fix typo in README.md: changed "Built" to "Build" in the container se…
abhinavg4 Dec 1, 2025
8640f3f
Merge branch 'main' into readme_init
abhinavg4 Dec 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
203 changes: 174 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,175 @@
# NeMo DFM: Diffusion Foundation Models collection

NeMo DFM is a state-of-the-art framework for fast, large-scale training and inference of video world models. It unifies the latest diffusion-based and autoregressive techniques, prioritizing efficiency and performance from research prototyping to production deployment.

## Projects

This collection consists of 4 projects:
1. [Scalable diffusion training framework](nemo_vfm/diffusion/readme.rst)
2. [Accelerated diffusion world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/diffusion/README.md)
3. [Accelerated autoregressive world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/autoregressive/README.md)
4. [Sparse attention for efficient diffusion inference](nemo_vfm/sparse_attention/README.md)

## Citations

If you find our code useful, please consider citing the following papers:
```bibtex
@article{patel2025training,
title={Training Video Foundation Models with NVIDIA NeMo},
author={Patel, Zeeshan and He, Ethan and Mannan, Parth and Ren, Xiaowei and Wolf, Ryan and Agarwal, Niket and Huffman, Jacob and Wang, Zhuoyao and Wang, Carl and Chang, Jack and others},
journal={arXiv preprint arXiv:2503.12964},
year={2025}
}

@article{agarwal2025cosmos,
title={Cosmos world foundation model platform for physical ai},
author={Agarwal, Niket and Ali, Arslan and Bala, Maciej and Balaji, Yogesh and Barker, Erik and Cai, Tiffany and Chattopadhyay, Prithvijit and Chen, Yongxin and Cui, Yin and Ding, Yifan and others},
journal={arXiv preprint arXiv:2501.03575},
year={2025}
}
<div align="center">

# NeMo DFM: Diffusion Foundation Models


<!-- We are still using Mbridge CICD NeMo. @pablo can we get our own? and the same for star gazer-->

<!-- Not includeing codecov for now since we have not worked on it extensively-->

[![CICD NeMo](https://github.com/NVIDIA-NeMo/DFM/actions/workflows/cicd-main.yml/badge.svg)](https://github.com/NVIDIA-NeMo/DFM/actions/workflows/cicd-main.yml)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
[![GitHub Stars](https://img.shields.io/github/stars/NVIDIA-NeMo/DFM.svg?style=social&label=Star&cacheSeconds=14400)](https://github.com/NVIDIA-NeMo/DFM/stargazers/)

[Documentation](https://github.com/NVIDIA-NeMo/DFM/tree/main/docs) | [Supported Models](#supported-models) | [Examples](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples) | [Contributing](https://github.com/NVIDIA-NeMo/DFM/tree/main/CONTRIBUTING.md)

</div>

## Overview

NeMo DFM (Diffusion Foundation Models) is a library under [NeMo Framework](https://github.com/NVIDIA-NeMo), focusing on diffusion models for **Video**, **Image**, and **Text** generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment.

**Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility:

- **🌉 Megatron Bridge Path**: Built on [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with n-D parallelism (TP, PP, CP, EP, VPP, DP)
- **🚀 AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training, for easy experimentation and also Day-0 support on 🤗 Hugging Face models.

Choose the path that best fits your workflow—or use both for different stages of development!

<!-- Once we have updated images of how DFM fits into NeMo journey. Put them here. @Eliiot can help.-->
## 🔧 Installation

### 🐳 Build your own Container

#### 1. Build the container
```bash
# Initialize all submodules (Megatron-Bridge, Automodel, and nested Megatron-LM)
git submodule update --init --recursive

# Build the container
docker build -f docker/Dockerfile.ci -t dfm:dev .
```

#### 2. Start the container

```bash
docker run --rm -it --gpus all \
--entrypoint bash \
-v $(pwd):/opt/DFM -it dfm:dev
```



### 📦 Using DFM Docker (Coming Soon)

## ⚡ Quickstart

### Megatron Bridge Path

#### Run a Recipe
You can find all predefined recipes under [recipes](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/megatron/recipes) directory.

> **Note:** You will have to use [uv](https://docs.astral.sh/uv/) to run the recipes. Please use `--group` as `megatron-bridge`.

```bash
uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
examples/megatron/recipes/wan/pretrain_wan.py \
--config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \
--training-mode pretrain \
--mock
```

### AutoModel Path

Train with PyTorch-native DTensor parallelism and direct 🤗 HF integration:

#### Run a Recipe

You can find pre-configured recipes under [automodel/finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune) and [automodel/pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain) directories.

> Note: AutoModel examples live under `dfm/examples/automodel`. Use [uv](https://docs.astral.sh/uv/) with `--group automodel`. Configs are YAML-driven; pass `-c <path>` to override the default.

The fine-tune recipe sets up WAN 2.1 Text-to-Video training with Flow Matching using FSDP2 Hybrid Sharding.
It parallelizes heavy transformer blocks while keeping lightweight modules (e.g., VAE) unsharded for efficiency.
Adjust batch sizes, LR, and parallel sizes in `dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml`.
The generation script demonstrates distributed inference with AutoModel DTensor managers, producing an MP4 on rank 0. You can tweak frame size, frames, steps, and CFG in flags.

```bash
# Fine-tune WAN 2.1 T2V with FSDP2 (single node, 8 GPUs)
uv run --group automodel torchrun --nproc-per-node=8 \
dfm/examples/automodel/finetune/finetune.py \
-c dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml

# Generate videos with FSDP2 (distributed inference)
uv run --group automodel torchrun --nproc-per-node=8 \
dfm/examples/automodel/generate/wan_generate.py
```

## 🚀 Key Features

### Dual Training Paths

**Megatron Bridge** delivers maximum throughput and scalability with near-linear performance to thousands of nodes. **AutoModel** provides an easy on-ramp for experimentation and research with PyTorch-native SPMD training.

### Shared Capabilities

- **🎥 Multi-Modal Diffusion**: Support for video, image, and text generation
- **🔬 Advanced Samplers**: EDM, Flow Matching, and custom diffusion schedules
- **🎭 Flexible Architectures**: DiT (Diffusion Transformers), WAN (World Action Networks)
- **📊 Efficient Data Loading**: Data pipelines with sequence packing
- **💾 Distributed Checkpointing**: SafeTensors-based sharded checkpoints
- **🌟 Memory Optimization**: Gradient checkpointing, mixed precision, efficient attention
- **🤗 HuggingFace Integration**: Seamless integration with the HF ecosystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is HF integration a shared capabilities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both can integrate with HF. We have bridge in Megatron-Bridge and Automodel anyways uses pytorch ?

That was the intention behind this? Would you like us to remove/rework it ?


## Supported Models

DFM provides out-of-the-box support for state-of-the-art diffusion architectures:

| Model | Type | Megatron Bridge | AutoModel | Description |
|-------|------|-----------------|-----------|-------------|
| **DiT** | Image/Video | [pretrain](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/pretrain_dit_model.py), [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py) | 🔜 | Diffusion Transformers with scalable architecture |
| **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py) | [pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain), [finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune),[inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/automodel/generate/wan_validate.py) | World Action Networks for video generation |

## Performance Benchmarking

For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance-summary.md] in our documentation.

## Project Structure

```
DFM/
├── dfm/
│ └── src/
│ ├── megatron/ # Megatron Bridge path
│ │ ├── base/ # Base utilities for Megatron
│ │ ├── data/ # Data loaders and task encoders
│ │ │ ├── common/ # Shared data utilities
│ │ │ ├── <model_name>/ # model-specific data handling
│ │ ├── model/ # Model implementations
│ │ │ ├── common/ # Shared model components
│ │ │ ├── <model_name>/ # model-specific implementations
│ │ └── recipes/ # Training recipes
│ │ ├── <model_name>/ # model-specific training configs
│ ├── automodel # AutoModel path (DTensor-native)
│ │ ├── _diffusers/ # Diffusion pipeline integrations
│ │ ├── datasets/ # Dataset implementations
│ │ ├── distributed/ # Parallelization strategies
│ │ ├── flow_matching/ # Flow matching implementations
│ │ ├── recipes/ # Training scripts
│ │ └── utils/ # Utilities and validation
│ └── common/ # Shared across both paths
│ ├── data/ # Common data utilities
│ └── utils/ # Batch ops, video utils, etc.
├── examples/ # Example scripts and configs
```

## 🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on:

- Setting up your development environment
- Code style and testing guidelines
- Submitting pull requests
- Reporting issues

For questions or discussions, please open an issue on GitHub.

## Acknowledgements

NeMo DFM builds upon the excellent work of:

- [Megatron-core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) - Advanced model parallelism
- [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) - HuggingFace ↔ Megatron bridge
- [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) - PyTorch-native SPMD training
- [PyTorch Distributed](https://pytorch.org/docs/stable/distributed.html) - Foundation for distributed training
- [Diffusers](https://github.com/huggingface/diffusers) - Diffusion model implementations
75 changes: 75 additions & 0 deletions docs/performance-summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Performance

As part of the NVIDIA NeMo Framework, DFM, provides the most recent training techniques for training advanced generative AI models, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput.

This page provides the current performance benchmarks for models using DFM across different GPU systems and configurations as we continue to optimize the model for optimal performance. Please refer to `examples/megatron/recipes/wan/conf` for updated YAML configurations.

## Nomenclature

- **GBS**: Global Batch Size
- **MBS**: Micro Batch Size
- **FSDP**: Fully Sharded Data Parallel
- FSDP = 1: use FSDP
- FSDP = 0: use DDP (Distributed Data Parallel)
- **TP**: Tensor Parallel Size
- **SP**: Sequence Parallel
- **PP**: Pipeline Parallel Size
- **CP**: Context Parallel Size
- **VP**: Virtual Pipeline Parallel Size
- **EP**: Expert Parallel Size

## Performance Metrics

Performance is measured using:
- **Tokens/sec/GPU**: Throughput per GPU
- **Model TFLOP/sec/GPU**: Model floating-point operations per second per GPU

```{contents}
:local:
:depth: 2
```

## Performance Summary for Models

Below are performance benchmarks for various models using DFM framework.

The performance data includes:

- **Pre-training Performance**: Throughput metrics for various model sizes and architectures
- **System Configurations**: Results across different GPU systems (DGX-GB200, DGX-GB300, DGX-H100)

---

## Megatron-Core Pre-Training Performance

#### System: DGX-GB200

| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
|Wan 2.1 14B|32|64|1|37440|0|1|0|1|4|0|0|787.59|


#### System: DGX-GB300

| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
|Wan 2.1 14B|32|64|1|37440|0|1|0|1|2|0|0|1,022.26|

#### System: DGX-H100

| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
|Wan 2.1 14B|128|128|1|37440|0|2|1|1|4|0|0|325.77|


## NeMo Automodel Pre-Training Performance
The following table summarizes the performance leveraging the NeMo Automodel backend.

#### System: DGX-H100

| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | DP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|----|-------------------------|
|Wan 2.1 14B|8|8|1|37440|1|8|1|1|1|1|0|0|175.88|
|Wan 2.1 14B|64|64|1|37440|1|64|1|1|1|1|0|0|228.85|