NeMo DFM: Diffusion Foundation Models

Documentation | Supported Models | Examples | Contributing

Overview

NeMo DFM (Diffusion Foundation Models) is a comprehensive collection of diffusion models for Video, Image, and Text generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment.

Dual-Path Architecture: DFM provides two complementary training paths to maximize flexibility:

🌉 Megatron Bridge Path: Built on Megatron Bridge which leverages Megatron Core for maximum scalability with 6D parallelism (TP, PP, CP, EP, VPP, DP)
🚀 AutoModel Path: Built on NeMo AutoModel for PyTorch DTensor-native SPMD training with seamless 🤗 Hugging Face integration

Choose the path that best fits your workflow—or use both for different stages of development!

🔧 Installation

🐳 Built your own Container

1. Build the container

# Initialize all submodules (Megatron-Bridge, Automodel, and nested Megatron-LM)
git submodule update --init --recursive

# Build the container
docker build -f docker/Dockerfile.ci -t dfm:dev .

2. Start the container

docker run --rm -it --gpus all \
  --entrypoint bash \
  -v $(pwd):/opt/DFM -it dfm:dev

📦 Using DFM Docker (Coming Soon)

⚡ Quickstart

Megatron Bridge Path

Run a Recipe

You can find all predefined recipes under recipes directory.

Note: You will have to use uv to run the recipes. Please use --group as megatron-bridge.

uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
  examples/megatron/recipes/wan/pretrain_wan.py \
  --config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \
  --training-mode pretrain \
  --mock

AutoModel Path

Train with PyTorch-native DTensor parallelism and direct 🤗 HF integration:

Run a Recipe

You can find pre-configured recipes under automodel/finetune and automodel/pretrain directories.

Note: AutoModel examples live under dfm/examples/automodel. Use uv with --group automodel. Configs are YAML-driven; pass -c <path> to override the default.

The fine-tune recipe sets up WAN 2.1 Text-to-Video training with Flow Matching using FSDP2 Hybrid Sharding. It parallelizes heavy transformer blocks while keeping lightweight modules (e.g., VAE) unsharded for efficiency. Adjust batch sizes, LR, and parallel sizes in dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml. The generation script demonstrates distributed inference with AutoModel DTensor managers, producing an MP4 on rank 0. You can tweak frame size, frames, steps, and CFG in flags.

# Fine-tune WAN 2.1 T2V with FSDP2 (single node, 8 GPUs)
uv run --group automodel torchrun --nproc-per-node=8 \
  dfm/examples/automodel/finetune/finetune.py \
  -c dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml

# Generate videos with FSDP2 (distributed inference)
uv run --group automodel torchrun --nproc-per-node=8 \
  dfm/examples/automodel/generate/wan_generate.py

🚀 Key Features

Dual Training Paths

Megatron Bridge Path
- State-of-the-art performance optimizations (TFLOPs)
- 🎯 Advanced parallelism: Tensor (TP), Context (CP), Data (DP), etc.
- 📈 Near-linear scalability to thousands of nodes
- 🔧 Production-ready recipes with optimized hyperparameters
AutoModel Path
- 🌐 PyTorch DTensor-native SPMD training
- 🚀 Advanced parallelisms (TP, PP, etc.) coming soon!
- 🔀 FSDP2-based Hybrid Sharding Data Parallelism (HSDP)
- 📦 Sequence packing for efficient training
- 🎨 Minimal ceremony with YAML-driven configs

Shared Capabilities

🎥 Multi-Modal Diffusion: Support for video, image, and text generation
🔬 Advanced Samplers: EDM, Flow Matching, and custom diffusion schedules
🎭 Flexible Architectures: DiT (Diffusion Transformers), WAN (World Action Networks)
📊 Efficient Data Loading: Data pipelines with sequence packing
💾 Distributed Checkpointing: SafeTensors-based sharded checkpoints
🌟 Memory Optimization: Gradient checkpointing, mixed precision, efficient attention
🤗 HuggingFace Integration: Seamless integration with the HF ecosystem

Supported Models

DFM provides out-of-the-box support for state-of-the-art diffusion architectures:

Model	Type	Megatron Bridge	AutoModel	Description
DiT	Image/Video	pretrain, inference	🔜	Diffusion Transformers with scalable architecture
WAN 2.1	Video	inference, pretrain, finetune	pretrain, finetune,inference	World Action Networks for video generation

Performance Benchmarking

For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance-summary.md] in our documentation.

Project Structure

DFM/
├── dfm/
│   └── src/
│       ├── megatron/              # Megatron Bridge path
│       │   ├── base/              # Base utilities for Megatron
│       │   ├── data/              # Data loaders and task encoders
│       │   │   ├── common/        # Shared data utilities
│       │   │   ├── <model_name>/  # model-specific data handling
│       │   ├── model/             # Model implementations
│       │   │   ├── common/        # Shared model components
│       │   │   ├── <model_name>/  # model-specific implementations
│       │   └── recipes/           # Training recipes
│       │       ├── <model_name>/  # model-specific training configs
│       ├── automodel              # AutoModel path (DTensor-native)
│       │   ├── _diffusers/        # Diffusion pipeline integrations
│       │   ├── datasets/          # Dataset implementations
│       │   ├── distributed/       # Parallelization strategies
│       │   ├── flow_matching/     # Flow matching implementations
│       │   ├── recipes/           # Training scripts
│       │   └── utils/             # Utilities and validation
│       └── common/                # Shared across both paths
│           ├── data/              # Common data utilities
│           └── utils/             # Batch ops, video utils, etc.
├── examples/                      # Example scripts and configs

🎯 Choosing Your Path

Feature	Megatron Bridge	AutoModel
Best For	Maximum scale (1000+ GPUs)	Flexibility & fast iteration
Parallelism	6D (TP, CP, DP, etc.)	FSDP2; (TP, SP, CP available soon)
HF Integration	Via bridge/conversion	HF-native (via DTensor)
Checkpoint Format	Megatron + HF export	HF-native (SafeTensors with DCP)
Learning Curve	Steeper (more knobs)	Gentler (YAML-driven)
Performance	Highest at scale	Excellent, pytorch-native

Recommendation:

Start with AutoModel for quick prototyping and HF model compatibility
Move to Megatron Bridge when scaling to 100+ GPUs or need advanced parallelism
Use both: prototype with AutoModel, scale with Megatron Bridge!

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on:

Setting up your development environment
Code style and testing guidelines
Submitting pull requests
Reporting issues

For questions or discussions, please open an issue on GitHub.

Acknowledgements

NeMo DFM builds upon the excellent work of:

Megatron-LM - Advanced model parallelism
Megatron Bridge - HuggingFace ↔ Megatron bridge
NeMo AutoModel - PyTorch-native SPMD training
PyTorch Distributed - Foundation for distributed training
Diffusers - Diffusion model implementations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeMo DFM: Diffusion Foundation Models

Overview

🔧 Installation

🐳 Built your own Container

1. Build the container

2. Start the container

📦 Using DFM Docker (Coming Soon)

⚡ Quickstart

Megatron Bridge Path

Run a Recipe

AutoModel Path

Run a Recipe

🚀 Key Features

Dual Training Paths

Shared Capabilities

Supported Models

Performance Benchmarking

Project Structure

🎯 Choosing Your Path

🤝 Contributing

Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

NeMo DFM: Diffusion Foundation Models

Overview

🔧 Installation

🐳 Built your own Container

1. Build the container

2. Start the container

📦 Using DFM Docker (Coming Soon)

⚡ Quickstart

Megatron Bridge Path

Run a Recipe

AutoModel Path

Run a Recipe

🚀 Key Features

Dual Training Paths

Shared Capabilities

Supported Models

Performance Benchmarking

Project Structure

🎯 Choosing Your Path

🤝 Contributing

Acknowledgements