Skip to content

Commit ecc5c73

Browse files
abhinavg4sajadnparthmannanlinnanwangakoumpa
authored andcommitted
Initial README commit (#53)
* Initial README commit * Update README and add performance summary documentation - Corrected the link in the README for the performance summary to point to the correct file. - Introduced a new `performance-summary.md` document detailing performance benchmarks for large language models using DFM, including nomenclature, performance metrics, and system configurations. * add DiT megatron links. Signed-off-by: sajadn <[email protected]> * Performance Docs update Signed-off-by: Parth Mannan <[email protected]> * Performance Docs update fix Signed-off-by: Parth Mannan <[email protected]> * Update README to enhance clarity and accuracy - Removed redundant description of the framework. - Clarified the relationship between Megatron Bridge and Megatron Core in the Dual-Path Architecture section. * Enhance README with detailed performance optimizations and parallelism descriptions - Updated the Megatron Bridge Path section to include 6D parallelism details. - Added state-of-the-art performance optimizations to the Dual Training Paths section. - Clarified parallelism terminology in the comparison table for better understanding. * Update perf doc Signed-off-by: Parth Mannan <[email protected]> * update Signed-off-by: linnan wang <[email protected]> * Update README with fine-tuning command Removed TODO comment and added a command for fine-tuning a video diffusion model. * Apply suggestion from @akoumpa * Apply suggestion from @akoumpa * Apply suggestion from @akoumpa * Update README, Wan-related. Updated command syntax and improved clarity in README. * Apply suggestion from @akoumpa * Fixing typo @akoumpa * fix automodel section Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * update DFM-specific readme Signed-off-by: Pablo Garay <[email protected]> * Update performance-summary.md Thanks a lot @linnanwang for the bench numbers. * Update performance-summary.md * Update performance-summary.md * Update README.md Co-authored-by: Wenwen Gao <[email protected]> * Update README.md Co-authored-by: Wenwen Gao <[email protected]> * Update README.md Co-authored-by: Wenwen Gao <[email protected]> * Update README.md Co-authored-by: Wenwen Gao <[email protected]> * Refactor README.md and performance-summary.md for clarity and conciseness - Simplified descriptions of Megatron Bridge and AutoModel paths in README.md. - Removed outdated comparison table to streamline content. - Updated performance-summary.md to generalize model references and improve clarity. Co-authored-by: Wenwen Gao <[email protected]> * Fix typo in README.md: changed "Built" to "Build" in the container section header for consistency. --------- Signed-off-by: sajadn <[email protected]> Signed-off-by: Parth Mannan <[email protected]> Signed-off-by: linnan wang <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Pablo Garay <[email protected]> Co-authored-by: sajadn <[email protected]> Co-authored-by: Parth Mannan <[email protected]> Co-authored-by: linnan wang <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Huy Vu <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Co-authored-by: Wenwen Gao <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
1 parent a469d3f commit ecc5c73

File tree

2 files changed

+249
-29
lines changed

2 files changed

+249
-29
lines changed

README.md

Lines changed: 174 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,175 @@
1-
# NeMo DFM: Diffusion Foundation Models collection
2-
3-
NeMo DFM is a state-of-the-art framework for fast, large-scale training and inference of video world models. It unifies the latest diffusion-based and autoregressive techniques, prioritizing efficiency and performance from research prototyping to production deployment.
4-
5-
## Projects
6-
7-
This collection consists of 4 projects:
8-
1. [Scalable diffusion training framework](nemo_vfm/diffusion/readme.rst)
9-
2. [Accelerated diffusion world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/diffusion/README.md)
10-
3. [Accelerated autoregressive world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/autoregressive/README.md)
11-
4. [Sparse attention for efficient diffusion inference](nemo_vfm/sparse_attention/README.md)
12-
13-
## Citations
14-
15-
If you find our code useful, please consider citing the following papers:
16-
```bibtex
17-
@article{patel2025training,
18-
title={Training Video Foundation Models with NVIDIA NeMo},
19-
author={Patel, Zeeshan and He, Ethan and Mannan, Parth and Ren, Xiaowei and Wolf, Ryan and Agarwal, Niket and Huffman, Jacob and Wang, Zhuoyao and Wang, Carl and Chang, Jack and others},
20-
journal={arXiv preprint arXiv:2503.12964},
21-
year={2025}
22-
}
23-
24-
@article{agarwal2025cosmos,
25-
title={Cosmos world foundation model platform for physical ai},
26-
author={Agarwal, Niket and Ali, Arslan and Bala, Maciej and Balaji, Yogesh and Barker, Erik and Cai, Tiffany and Chattopadhyay, Prithvijit and Chen, Yongxin and Cui, Yin and Ding, Yifan and others},
27-
journal={arXiv preprint arXiv:2501.03575},
28-
year={2025}
29-
}
1+
<div align="center">
2+
3+
# NeMo DFM: Diffusion Foundation Models
4+
5+
6+
<!-- We are still using Mbridge CICD NeMo. @pablo can we get our own? and the same for star gazer-->
7+
8+
<!-- Not includeing codecov for now since we have not worked on it extensively-->
9+
10+
[![CICD NeMo](https://github.com/NVIDIA-NeMo/DFM/actions/workflows/cicd-main.yml/badge.svg)](https://github.com/NVIDIA-NeMo/DFM/actions/workflows/cicd-main.yml)
11+
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
12+
[![GitHub Stars](https://img.shields.io/github/stars/NVIDIA-NeMo/DFM.svg?style=social&label=Star&cacheSeconds=14400)](https://github.com/NVIDIA-NeMo/DFM/stargazers/)
13+
14+
[Documentation](https://github.com/NVIDIA-NeMo/DFM/tree/main/docs) | [Supported Models](#supported-models) | [Examples](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples) | [Contributing](https://github.com/NVIDIA-NeMo/DFM/tree/main/CONTRIBUTING.md)
15+
16+
</div>
17+
18+
## Overview
19+
20+
NeMo DFM (Diffusion Foundation Models) is a library under [NeMo Framework](https://github.com/NVIDIA-NeMo), focusing on diffusion models for **Video**, **Image**, and **Text** generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment.
21+
22+
**Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility:
23+
24+
- **🌉 Megatron Bridge Path**: Built on [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with n-D parallelism (TP, PP, CP, EP, VPP, DP)
25+
- **🚀 AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training, for easy experimentation and also Day-0 support on 🤗 Hugging Face models.
26+
27+
Choose the path that best fits your workflow—or use both for different stages of development!
28+
29+
<!-- Once we have updated images of how DFM fits into NeMo journey. Put them here. @Eliiot can help.-->
30+
## 🔧 Installation
31+
32+
### 🐳 Build your own Container
33+
34+
#### 1. Build the container
35+
```bash
36+
# Initialize all submodules (Megatron-Bridge, Automodel, and nested Megatron-LM)
37+
git submodule update --init --recursive
38+
39+
# Build the container
40+
docker build -f docker/Dockerfile.ci -t dfm:dev .
41+
```
42+
43+
#### 2. Start the container
44+
45+
```bash
46+
docker run --rm -it --gpus all \
47+
--entrypoint bash \
48+
-v $(pwd):/opt/DFM -it dfm:dev
49+
```
50+
51+
52+
53+
### 📦 Using DFM Docker (Coming Soon)
54+
55+
## ⚡ Quickstart
56+
57+
### Megatron Bridge Path
58+
59+
#### Run a Recipe
60+
You can find all predefined recipes under [recipes](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/megatron/recipes) directory.
61+
62+
> **Note:** You will have to use [uv](https://docs.astral.sh/uv/) to run the recipes. Please use `--group` as `megatron-bridge`.
63+
64+
```bash
65+
uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
66+
examples/megatron/recipes/wan/pretrain_wan.py \
67+
--config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \
68+
--training-mode pretrain \
69+
--mock
70+
```
71+
72+
### AutoModel Path
73+
74+
Train with PyTorch-native DTensor parallelism and direct 🤗 HF integration:
75+
76+
#### Run a Recipe
77+
78+
You can find pre-configured recipes under [automodel/finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune) and [automodel/pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain) directories.
79+
80+
> Note: AutoModel examples live under `dfm/examples/automodel`. Use [uv](https://docs.astral.sh/uv/) with `--group automodel`. Configs are YAML-driven; pass `-c <path>` to override the default.
81+
82+
The fine-tune recipe sets up WAN 2.1 Text-to-Video training with Flow Matching using FSDP2 Hybrid Sharding.
83+
It parallelizes heavy transformer blocks while keeping lightweight modules (e.g., VAE) unsharded for efficiency.
84+
Adjust batch sizes, LR, and parallel sizes in `dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml`.
85+
The generation script demonstrates distributed inference with AutoModel DTensor managers, producing an MP4 on rank 0. You can tweak frame size, frames, steps, and CFG in flags.
86+
87+
```bash
88+
# Fine-tune WAN 2.1 T2V with FSDP2 (single node, 8 GPUs)
89+
uv run --group automodel torchrun --nproc-per-node=8 \
90+
dfm/examples/automodel/finetune/finetune.py \
91+
-c dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml
92+
93+
# Generate videos with FSDP2 (distributed inference)
94+
uv run --group automodel torchrun --nproc-per-node=8 \
95+
dfm/examples/automodel/generate/wan_generate.py
3096
```
97+
98+
## 🚀 Key Features
99+
100+
### Dual Training Paths
101+
102+
**Megatron Bridge** delivers maximum throughput and scalability with near-linear performance to thousands of nodes. **AutoModel** provides an easy on-ramp for experimentation and research with PyTorch-native SPMD training.
103+
104+
### Shared Capabilities
105+
106+
- **🎥 Multi-Modal Diffusion**: Support for video, image, and text generation
107+
- **🔬 Advanced Samplers**: EDM, Flow Matching, and custom diffusion schedules
108+
- **🎭 Flexible Architectures**: DiT (Diffusion Transformers), WAN (World Action Networks)
109+
- **📊 Efficient Data Loading**: Data pipelines with sequence packing
110+
- **💾 Distributed Checkpointing**: SafeTensors-based sharded checkpoints
111+
- **🌟 Memory Optimization**: Gradient checkpointing, mixed precision, efficient attention
112+
- **🤗 HuggingFace Integration**: Seamless integration with the HF ecosystem
113+
114+
## Supported Models
115+
116+
DFM provides out-of-the-box support for state-of-the-art diffusion architectures:
117+
118+
| Model | Type | Megatron Bridge | AutoModel | Description |
119+
|-------|------|-----------------|-----------|-------------|
120+
| **DiT** | Image/Video | [pretrain](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/pretrain_dit_model.py), [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py) | 🔜 | Diffusion Transformers with scalable architecture |
121+
| **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py) | [pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain), [finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune),[inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/automodel/generate/wan_validate.py) | World Action Networks for video generation |
122+
123+
## Performance Benchmarking
124+
125+
For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance-summary.md] in our documentation.
126+
127+
## Project Structure
128+
129+
```
130+
DFM/
131+
├── dfm/
132+
│ └── src/
133+
│ ├── megatron/ # Megatron Bridge path
134+
│ │ ├── base/ # Base utilities for Megatron
135+
│ │ ├── data/ # Data loaders and task encoders
136+
│ │ │ ├── common/ # Shared data utilities
137+
│ │ │ ├── <model_name>/ # model-specific data handling
138+
│ │ ├── model/ # Model implementations
139+
│ │ │ ├── common/ # Shared model components
140+
│ │ │ ├── <model_name>/ # model-specific implementations
141+
│ │ └── recipes/ # Training recipes
142+
│ │ ├── <model_name>/ # model-specific training configs
143+
│ ├── automodel # AutoModel path (DTensor-native)
144+
│ │ ├── _diffusers/ # Diffusion pipeline integrations
145+
│ │ ├── datasets/ # Dataset implementations
146+
│ │ ├── distributed/ # Parallelization strategies
147+
│ │ ├── flow_matching/ # Flow matching implementations
148+
│ │ ├── recipes/ # Training scripts
149+
│ │ └── utils/ # Utilities and validation
150+
│ └── common/ # Shared across both paths
151+
│ ├── data/ # Common data utilities
152+
│ └── utils/ # Batch ops, video utils, etc.
153+
├── examples/ # Example scripts and configs
154+
```
155+
156+
## 🤝 Contributing
157+
158+
We welcome contributions! Please see our Contributing Guide for details on:
159+
160+
- Setting up your development environment
161+
- Code style and testing guidelines
162+
- Submitting pull requests
163+
- Reporting issues
164+
165+
For questions or discussions, please open an issue on GitHub.
166+
167+
## Acknowledgements
168+
169+
NeMo DFM builds upon the excellent work of:
170+
171+
- [Megatron-core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) - Advanced model parallelism
172+
- [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) - HuggingFace ↔ Megatron bridge
173+
- [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) - PyTorch-native SPMD training
174+
- [PyTorch Distributed](https://pytorch.org/docs/stable/distributed.html) - Foundation for distributed training
175+
- [Diffusers](https://github.com/huggingface/diffusers) - Diffusion model implementations

docs/performance-summary.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Performance
2+
3+
As part of the NVIDIA NeMo Framework, DFM, provides the most recent training techniques for training advanced generative AI models, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput.
4+
5+
This page provides the current performance benchmarks for models using DFM across different GPU systems and configurations as we continue to optimize the model for optimal performance. Please refer to `examples/megatron/recipes/wan/conf` for updated YAML configurations.
6+
7+
## Nomenclature
8+
9+
- **GBS**: Global Batch Size
10+
- **MBS**: Micro Batch Size
11+
- **FSDP**: Fully Sharded Data Parallel
12+
- FSDP = 1: use FSDP
13+
- FSDP = 0: use DDP (Distributed Data Parallel)
14+
- **TP**: Tensor Parallel Size
15+
- **SP**: Sequence Parallel
16+
- **PP**: Pipeline Parallel Size
17+
- **CP**: Context Parallel Size
18+
- **VP**: Virtual Pipeline Parallel Size
19+
- **EP**: Expert Parallel Size
20+
21+
## Performance Metrics
22+
23+
Performance is measured using:
24+
- **Tokens/sec/GPU**: Throughput per GPU
25+
- **Model TFLOP/sec/GPU**: Model floating-point operations per second per GPU
26+
27+
```{contents}
28+
:local:
29+
:depth: 2
30+
```
31+
32+
## Performance Summary for Models
33+
34+
Below are performance benchmarks for various models using DFM framework.
35+
36+
The performance data includes:
37+
38+
- **Pre-training Performance**: Throughput metrics for various model sizes and architectures
39+
- **System Configurations**: Results across different GPU systems (DGX-GB200, DGX-GB300, DGX-H100)
40+
41+
---
42+
43+
## Megatron-Core Pre-Training Performance
44+
45+
#### System: DGX-GB200
46+
47+
| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
48+
|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
49+
|Wan 2.1 14B|32|64|1|37440|0|1|0|1|4|0|0|787.59|
50+
51+
52+
#### System: DGX-GB300
53+
54+
| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
55+
|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
56+
|Wan 2.1 14B|32|64|1|37440|0|1|0|1|2|0|0|1,022.26|
57+
58+
#### System: DGX-H100
59+
60+
| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
61+
|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
62+
|Wan 2.1 14B|128|128|1|37440|0|2|1|1|4|0|0|325.77|
63+
64+
65+
## NeMo Automodel Pre-Training Performance
66+
The following table summarizes the performance leveraging the NeMo Automodel backend.
67+
68+
#### System: DGX-H100
69+
70+
| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | DP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
71+
|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|----|-------------------------|
72+
|Wan 2.1 14B|8|8|1|37440|1|8|1|1|1|1|0|0|175.88|
73+
|Wan 2.1 14B|64|64|1|37440|1|64|1|1|1|1|0|0|228.85|
74+
75+

0 commit comments

Comments
 (0)