-
Notifications
You must be signed in to change notification settings - Fork 2
Initial README commit #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
3266077
Initial README commit
abhinavg4 9add867
Update README and add performance summary documentation
abhinavg4 79f9d26
add DiT megatron links.
sajadn b96cf8f
Performance Docs update
parthmannan 2b00158
Performance Docs update fix
parthmannan 8e471a0
Update README to enhance clarity and accuracy
abhinavg4 6f92b01
Enhance README with detailed performance optimizations and parallelis…
abhinavg4 2233811
Update perf doc
parthmannan 60fae1d
Merge branch 'readme_init' of github.com:NVIDIA-NeMo/DFM into readme_…
parthmannan 88ddbf1
update
linnanwang 2aaae5e
Update README with fine-tuning command
linnanwang 9abba18
Apply suggestion from @akoumpa
akoumpa 22c6790
Apply suggestion from @akoumpa
akoumpa 49c8a24
Apply suggestion from @akoumpa
akoumpa 10433b3
Update README, Wan-related.
huvunvidia 901174e
Apply suggestion from @akoumpa
akoumpa 03560c7
Fixing typo @akoumpa
akoumpa ca6d9cf
fix automodel section
akoumpa 4b38e3d
fix
akoumpa b628d48
update DFM-specific readme
pablo-garay 48d65a6
Update performance-summary.md
akoumpa fec3b40
Update performance-summary.md
akoumpa df982be
Update performance-summary.md
akoumpa 796103e
Update README.md
abhinavg4 9ea6116
Update README.md
abhinavg4 ebf00bf
Update README.md
abhinavg4 7083f86
Update README.md
abhinavg4 f6f3a30
Refactor README.md and performance-summary.md for clarity and concise…
abhinavg4 31e7def
Merge branch 'main' into readme_init
abhinavg4 f86c51e
Fix typo in README.md: changed "Built" to "Build" in the container se…
abhinavg4 8640f3f
Merge branch 'main' into readme_init
abhinavg4 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,30 +1,198 @@ | ||
| # NeMo DFM: Diffusion Foundation Models collection | ||
|
|
||
| NeMo DFM is a state-of-the-art framework for fast, large-scale training and inference of video world models. It unifies the latest diffusion-based and autoregressive techniques, prioritizing efficiency and performance from research prototyping to production deployment. | ||
|
|
||
| ## Projects | ||
|
|
||
| This collection consists of 4 projects: | ||
| 1. [Scalable diffusion training framework](nemo_vfm/diffusion/readme.rst) | ||
| 2. [Accelerated diffusion world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/diffusion/README.md) | ||
| 3. [Accelerated autoregressive world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/autoregressive/README.md) | ||
| 4. [Sparse attention for efficient diffusion inference](nemo_vfm/sparse_attention/README.md) | ||
|
|
||
| ## Citations | ||
|
|
||
| If you find our code useful, please consider citing the following papers: | ||
| ```bibtex | ||
| @article{patel2025training, | ||
| title={Training Video Foundation Models with NVIDIA NeMo}, | ||
| author={Patel, Zeeshan and He, Ethan and Mannan, Parth and Ren, Xiaowei and Wolf, Ryan and Agarwal, Niket and Huffman, Jacob and Wang, Zhuoyao and Wang, Carl and Chang, Jack and others}, | ||
| journal={arXiv preprint arXiv:2503.12964}, | ||
| year={2025} | ||
| } | ||
| @article{agarwal2025cosmos, | ||
| title={Cosmos world foundation model platform for physical ai}, | ||
| author={Agarwal, Niket and Ali, Arslan and Bala, Maciej and Balaji, Yogesh and Barker, Erik and Cai, Tiffany and Chattopadhyay, Prithvijit and Chen, Yongxin and Cui, Yin and Ding, Yifan and others}, | ||
| journal={arXiv preprint arXiv:2501.03575}, | ||
| year={2025} | ||
| } | ||
| <div align="center"> | ||
|
|
||
| # NeMo DFM: Diffusion Foundation Models | ||
|
|
||
|
|
||
| <!-- We are still using Mbridge CICD NeMo. @pablo can we get our own? and the same for star gazer--> | ||
abhinavg4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| <!-- Not includeing codecov for now since we have not worked on it extensively--> | ||
|
|
||
| [](https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/workflows/cicd-main.yml) | ||
| [](https://www.python.org/downloads/release/python-3100/) | ||
| [](https://github.com/NVIDIA-NeMo/DFM/stargazers/) | ||
|
|
||
| **State-of-the-art framework for fast, large-scale training and inference of diffusion models** | ||
abhinavg4 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| [Documentation](https://github.com/NVIDIA-NeMo/DFM/tree/main/docs) | [Supported Models](#supported-models) | [Examples](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples) | [Contributing](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/CONTRIBUTING.md) | ||
|
|
||
| </div> | ||
|
|
||
| ## Overview | ||
|
|
||
| NeMo DFM (Diffusion Foundation Models) is a comprehensive collection of diffusion models for **Video**, **Image**, and **Text** generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment. | ||
abhinavg4 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| **Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility: | ||
|
|
||
| - **🌉 Megatron Bridge Path**: Built on [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) and [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with tensor, pipeline, and context parallelism | ||
abhinavg4 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - **🚀 AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training with seamless 🤗 Hugging Face integration | ||
abhinavg4 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Choose the path that best fits your workflow—or use both for different stages of development! | ||
|
|
||
| <!-- Once we have updated images of how DFM fits into NeMo journey. Put them here. @Eliiot can help.--> | ||
abhinavg4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## 🔧 Installation | ||
|
|
||
| ### 🐳 Built your own Container | ||
|
|
||
| #### 1. Build the container | ||
| ```bash | ||
| # Initialize all submodules (Megatron-Bridge, Automodel, and nested Megatron-LM) | ||
| git submodule update --init --recursive | ||
|
|
||
| # Build the container | ||
| docker build -f docker/Dockerfile.ci -t dfm:dev . | ||
| ``` | ||
|
|
||
| #### 2. Start the container | ||
|
|
||
| ```bash | ||
| docker run --rm -it --gpus all \ | ||
| --entrypoint bash \ | ||
| -v $(pwd):/opt/DFM -it dfm:dev | ||
| ``` | ||
|
|
||
|
|
||
|
|
||
| ### 📦 Using DFM Docker (Coming Soon) | ||
|
|
||
| ## ⚡ Quickstart | ||
|
|
||
| ### Megatron Bridge Path | ||
|
|
||
| #### Run a Receipe | ||
akoumpa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| You can find all predefined recipes under [recipes](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/megatron/recipes) directory. | ||
|
|
||
| > **Note:** You will have to use [uv](https://docs.astral.sh/uv/) to run the recipes. Please use `--group` as `megatron-bridge`. | ||
|
|
||
| <!-- @Huy please update the below command after you change defaults--> | ||
abhinavg4 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```bash | ||
| uv run --group megatron-bridge python -m torch.distributed.run --nproc_per_node=2 examples/megatron/recipes/wan/pretrain_wan.py model.qkv_format=thd --mock | ||
|
||
| ``` | ||
|
|
||
| ### AutoModel Path | ||
|
|
||
| Train with PyTorch-native DTensor parallelism and direct 🤗 HF integration: | ||
|
|
||
| <!-- @Linnan, @Alex please add this thanks a ton--> | ||
linnanwang marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ```bash | ||
| # TODO | ||
| # Fine-tune a video diffusion model with FSDP2 | ||
| uv run torchrun --nproc-per-node=8 \ | ||
| dfm/src/automodel/recipes/finetune.py \ | ||
| --config examples/automodel/wan21_finetune.yaml | ||
|
|
||
| # Override parameters via CLI | ||
| # TODO | ||
| uv run torchrun --nproc-per-node=8 \ | ||
| dfm/src/automodel/recipes/finetune.py \ | ||
| --config examples/automodel/wan21_finetune.yaml \ | ||
| --step_scheduler.local_batch_size 4 \ | ||
| --model.pretrained_model_name_or_path "your-model-id" | ||
| ``` | ||
|
|
||
| ## 🚀 Key Features | ||
|
|
||
| ### Dual Training Paths | ||
abhinavg4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - **Megatron Bridge Path** | ||
| - 🔄 Bidirectional HuggingFace ↔ Megatron checkpoint conversion | ||
| - 🎯 Advanced parallelism: Tensor (TP), Pipeline (PP), Context (CP), Expert (EP) | ||
| - 📈 Near-linear scalability to thousands of nodes | ||
| - 🔧 Production-ready recipes with optimized hyperparameters | ||
|
|
||
| - **AutoModel Path** | ||
| - 🌐 PyTorch DTensor-native SPMD training | ||
| - 🔀 FSDP2-based Hybrid Sharding Data Parallelism (HSDP) | ||
akoumpa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - 📦 Sequence packing for efficient training | ||
| - 🎨 Minimal ceremony with YAML-driven configs | ||
|
|
||
| ### Shared Capabilities | ||
|
|
||
| - **🎥 Multi-Modal Diffusion**: Support for video, image, and text generation | ||
| - **🔬 Advanced Samplers**: EDM, Flow Matching, and custom diffusion schedules | ||
| - **🎭 Flexible Architectures**: DiT (Diffusion Transformers), WAN (World Action Networks) | ||
| - **📊 Efficient Data Loading**: Data pipelines with sequence packing | ||
| - **💾 Distributed Checkpointing**: SafeTensors-based sharded checkpoints | ||
| - **🌟 Memory Optimization**: Gradient checkpointing, mixed precision, efficient attention | ||
|
|
||
| ## Supported Models | ||
|
|
||
| DFM provides out-of-the-box support for state-of-the-art diffusion architectures: | ||
|
|
||
| | Model | Type | Megatron Bridge | AutoModel | Description | | ||
| |-------|------|-----------------|-----------|-------------| | ||
| | **DiT** | Image/Video | [pretrain, finetune](@Sajad) | 🔜 | Diffusion Transformers with scalable architecture | | ||
abhinavg4 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py), conversion(@Huy) | @Linnan, @Alex | World Action Networks for video generation | | ||
abhinavg4 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Performance Benchmarking | ||
|
|
||
| For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance.md] in our documentation. | ||
|
|
||
| ## Project Structure | ||
|
|
||
| ``` | ||
| DFM/ | ||
| ├── dfm/ | ||
| │ └── src/ | ||
| │ ├── megatron/ # Megatron Bridge path | ||
| │ │ ├── base/ # Base utilities for Megatron | ||
| │ │ ├── data/ # Data loaders and task encoders | ||
| │ │ │ ├── common/ # Shared data utilities | ||
| │ │ │ ├── <model_name>/ # model-specific data handling | ||
| │ │ ├── model/ # Model implementations | ||
| │ │ │ ├── common/ # Shared model components | ||
| │ │ │ ├── <model_name>/ # model-specific implementations | ||
| │ │ └── recipes/ # Training recipes | ||
| │ │ ├── <model_name>/ # model-specific training configs | ||
| │ ├── automodel (@linnan, @alex)/ # AutoModel path (DTensor-native) | ||
akoumpa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| │ │ ├── _diffusers/ # Diffusion pipeline integrations | ||
| │ │ ├── datasets/ # Dataset implementations | ||
| │ │ ├── distributed/ # Parallelization strategies | ||
| │ │ ├── flow_matching/ # Flow matching implementations | ||
| │ │ ├── recipes/ # Training scripts | ||
| │ │ └── utils/ # Utilities and validation | ||
| │ └── common/ # Shared across both paths | ||
| │ ├── data/ # Common data utilities | ||
| │ └── utils/ # Batch ops, video utils, etc. | ||
| ├── examples/ # Example scripts and configs | ||
| ``` | ||
|
|
||
| ## 🎯 Choosing Your Path | ||
abhinavg4 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| | Feature | Megatron Bridge | AutoModel | | ||
| |---------|-----------------|-----------| | ||
| | **Best For** | Maximum scale (1000+ GPUs) | Flexibility & fast iteration | | ||
| | **Parallelism** | TP, PP, CP, EP, VPP | FSDP2, TP, SP, CP | | ||
| | **HF Integration** | Via bridge/conversion | PyTorch-native DTensor | | ||
akoumpa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | **Checkpoint Format** | Megatron + HF export | SafeTensors DCP | | ||
akoumpa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) | | ||
| | **Performance** | Highest at scale | Excellent, pytorch-native | | ||
|
|
||
| **Recommendation**: | ||
| - Start with **AutoModel** for quick prototyping and HF model compatibility | ||
| - Move to **Megatron Bridge** when scaling to 100+ GPUs or need advanced parallelism | ||
| - Use **both**: prototype with AutoModel, scale with Megatron Bridge! | ||
|
|
||
|
|
||
| ## 🤝 Contributing | ||
|
|
||
| We welcome contributions! Please see our Contributing Guide for details on: | ||
|
|
||
| - Setting up your development environment | ||
| - Code style and testing guidelines | ||
| - Submitting pull requests | ||
| - Reporting issues | ||
|
|
||
| For questions or discussions, please open an issue on GitHub. | ||
|
|
||
| ## Acknowledgements | ||
|
|
||
| NeMo DFM builds upon the excellent work of: | ||
|
|
||
| - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) - Advanced model parallelism | ||
abhinavg4 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) - HuggingFace ↔ Megatron bridge | ||
| - [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) - PyTorch-native SPMD training | ||
| - [PyTorch Distributed](https://pytorch.org/docs/stable/distributed.html) - Foundation for distributed training | ||
| - [Diffusers](https://github.com/huggingface/diffusers) - Diffusion model implementations | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.