|
| 1 | +## 🚀 Megatron DiT |
| 2 | + |
| 3 | +### 📋 Overview |
| 4 | +An open-source implementation of [Diffusion Transformers (DiTs)](https://github.com/facebookresearch/DiT) for training text-to-image/video models with [EDMPipeline](https://arxiv.org/abs/2206.00364). The implementation is based on [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) and [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) to bring both scalability and efficiency. Various parallelization techniques such as tensor, sequence, and context parallelism are currently supported. |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +### 📦 Dataset Preparation |
| 9 | +This recipe uses NVIDIA's [Megatron-Energon](https://github.com/NVIDIA/Megatron-Energon) as an efficient multi-modal data loader. Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon efficiently supports large-scale distributed loading, sharding, and sampling for multi-modal pairs (e.g., text-image, text-video). Set `dataset.path` to your WebDataset location or shard pattern. See the Megatron-Energon documentation for format details and advanced options. |
| 10 | + |
| 11 | +#### 🦋 Dataset Preparation Example |
| 12 | + |
| 13 | +As an example, you can use the [butterfly-dataset](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) available on Hugging Face. |
| 14 | + |
| 15 | +The script below packs the Hugging Face dataset into WebDataset format, which Energon requires. |
| 16 | +```bash |
| 17 | +uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \ |
| 18 | + examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py |
| 19 | +``` |
| 20 | + |
| 21 | +In case you already have the T5 model or video tokenizer downloaded, you can point to them with optional arguments `--t5_cache_dir` and `--tokenizer_cache_dir`. |
| 22 | + |
| 23 | + |
| 24 | +```bash |
| 25 | +uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \ |
| 26 | + examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py \ |
| 27 | + --t5_cache_dir $t5_cache_dir \ |
| 28 | + --tokenizer_cache_dir $tokenizer_cache_dir |
| 29 | +``` |
| 30 | + |
| 31 | +Then you need to run `energon prepare $dataset_path` and choose `CrudeWebdataset` as the sample type: |
| 32 | + |
| 33 | +```bash |
| 34 | +energon prepare ./ |
| 35 | + import pynvml # type: ignore[import] |
| 36 | +Found 8 tar files in total. The first and last ones are: |
| 37 | +- rank0-000000.tar |
| 38 | +- rank7-000000.tar |
| 39 | +If you want to exclude some of them, cancel with ctrl+c and specify an exclude filter in the command line. |
| 40 | +Please enter a desired train/val/test split like "0.5, 0.2, 0.3" or "8,1,1": 1,0,0 |
| 41 | +Saving info to /opt/datasets/butterfly_webdataset_new/.nv-meta/.info.yaml |
| 42 | +Sample 0, keys: |
| 43 | + - json |
| 44 | + - pickle |
| 45 | + - pth |
| 46 | +Json content of sample 0 of rank0-000000.tar: |
| 47 | +{ |
| 48 | + "image_height": 1, |
| 49 | + "image_width": 512 |
| 50 | +} |
| 51 | +Sample 1, keys: |
| 52 | + - json |
| 53 | + - pickle |
| 54 | + - pth |
| 55 | +Json content of sample 1 of rank0-000000.tar: |
| 56 | +{ |
| 57 | + "image_height": 1, |
| 58 | + "image_width": 512 |
| 59 | +} |
| 60 | +Found the following part types in the dataset: pth, json, pickle |
| 61 | +Do you want to create a dataset.yaml interactively? [Y/n]: y |
| 62 | +The following sample types are available: |
| 63 | +0. CaptioningSample |
| 64 | +1. ImageClassificationSample |
| 65 | +2. ImageSample |
| 66 | +3. InterleavedSample |
| 67 | +4. MultiChoiceVQASample |
| 68 | +5. OCRSample |
| 69 | +6. Sample |
| 70 | +7. SimilarityInterleavedSample |
| 71 | +8. TextSample |
| 72 | +9. VQASample |
| 73 | +10. VidQASample |
| 74 | +11. Crude sample (plain dict for cooking) |
| 75 | +Please enter a number to choose a class: 11 |
| 76 | +CrudeWebdataset does not need a field map. You will need to provide a `Cooker` for your dataset samples in your `TaskEncoder`. |
| 77 | +Furthermore, you might want to add `subflavors` in your meta dataset specification. |
| 78 | +Done |
| 79 | +``` |
| 80 | +
|
| 81 | +--- |
| 82 | +
|
| 83 | +### 🐳 Build Container |
| 84 | +
|
| 85 | +Please follow the instructions in the [container](https://github.com/NVIDIA-NeMo/DFM#-built-your-own-container) section of the main README. |
| 86 | +
|
| 87 | +--- |
| 88 | +
|
| 89 | +### 🏋️ Pretraining |
| 90 | +
|
| 91 | +Once you have the dataset and container ready, you can start training the DiT model on your own dataset. This repository leverages [sequence packing](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/features/optimizations/sequence_packing.html) to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length; therefore, `micro_batch_size` must be set to 1. Additionally, `qkv_format` should be set to `thd` to signal to Transformer Engine that sequence packing is enabled. |
| 92 | +
|
| 93 | +For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters. For further details you can look at [Energon packing](https://nvidia.github.io/Megatron-Energon/advanced/packing.html) documenation. |
| 94 | +
|
| 95 | +Multiple parallelism techniques including tensor, sequence, and context parallelism are supported and can be configured based on your computational requirements. |
| 96 | +
|
| 97 | +The model architecture can be customized through parameters such as `num_layers` and `num_attention_heads`. A comprehensive list of configuration options is available in the [Megatron-Bridge documentation](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docs/megatron-lm-to-megatron-bridge.md). |
| 98 | +
|
| 99 | +
|
| 100 | +**Note:** If using the `wandb_project` and `wandb_exp_name` arguments, ensure the `WANDB_API_KEY` environment variable is set. |
| 101 | +
|
| 102 | +
|
| 103 | +**Note:** During validation, the model generates one sample per GPU at the start of each validation round. These samples are saved to a `validation_generation` folder within `checkpoint_dir` and are also logged to Wandb if the `WANDB_API_KEY` environment variable is configured. To decode the generated latent samples, the model requires access to the video tokenizer used during dataset preparation. Specify the VAE artifacts location using the `vae_cache_folder` argument, otherwise they will be downloaded in the first validation round. |
| 104 | +
|
| 105 | +#### Pretraining script example |
| 106 | +First, copy the example config file and update it with your own settings: |
| 107 | +
|
| 108 | +```bash |
| 109 | +cp examples/megatron/recipes/dit/conf/dit_pretrain_example.yaml examples/megatron/recipes/dit/conf/my_config.yaml |
| 110 | +# Edit my_config.yaml to set: |
| 111 | +# - model.vae_cache_folder: Path to VAE cache folder |
| 112 | +# - dataset.path: Path to your dataset folder |
| 113 | +# - checkpoint.save and checkpoint.load: Path to checkpoint folder |
| 114 | +# - train.global_batch_size: Set to match be divisible by NUM_GPUs |
| 115 | +# - logger.wandb_exp_name: Your experiment name |
| 116 | +``` |
| 117 | +
|
| 118 | +Then run: |
| 119 | +
|
| 120 | +```bash |
| 121 | +uv run --group megatron-bridge python -m torch.distributed.run \ |
| 122 | + --nproc-per-node $NUM_GPUS examples/megatron/recipes/dit/pretrain_dit_model.py \ |
| 123 | + --config-file examples/megatron/recipes/dit/conf/my_config.yaml |
| 124 | +``` |
| 125 | +
|
| 126 | +You can still override any config values from the command line: |
| 127 | +
|
| 128 | +```bash |
| 129 | +uv run --group megatron-bridge python -m torch.distributed.run \ |
| 130 | + --nproc-per-node $num_gpus examples/megatron/recipes/dit/pretrain_dit_model.py \ |
| 131 | + --config-file examples/megatron/recipes/dit/conf/my_config.yaml \ |
| 132 | + train.train_iters=20000 \ |
| 133 | + model.num_layers=32 |
| 134 | +``` |
| 135 | +
|
| 136 | +**Note:** If you dedicate 100% of the data to training, you need to pass `dataset.use_train_split_for_val=true` to use a subset of training data for validation purposes. |
| 137 | +
|
| 138 | +```bash |
| 139 | +uv run --group megatron-bridge python -m torch.distributed.run \ |
| 140 | + --nproc-per-node $num_gpus examples/megatron/recipes/dit/pretrain_dit_model.py \ |
| 141 | + --config-file examples/megatron/recipes/dit/conf/my_config.yaml \ |
| 142 | + dataset.use_train_split_for_val=true |
| 143 | +``` |
| 144 | +
|
| 145 | +#### 🧪 Quick Start with Mock Dataset |
| 146 | +
|
| 147 | +If you want to run the code without having the dataset ready (for performance measurement purposes, for example), you can pass the `--mock` flag to activate a mock dataset. |
| 148 | +
|
| 149 | +```bash |
| 150 | +uv run --group megatron-bridge python -m torch.distributed.run \ |
| 151 | + --nproc-per-node $num_gpus examples/megatron/recipes/dit/pretrain_dit_model.py \ |
| 152 | + --config-file examples/megatron/recipes/dit/conf/dit_pretrain.yaml \ |
| 153 | + --mock |
| 154 | +``` |
| 155 | +
|
| 156 | +### 🎬 Inference |
| 157 | +
|
| 158 | +Once training completes, you can run inference using [inference_dit_model.py](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py). The script requires your trained model checkpoint (`--checkpoint_path`) and a path to save generated videos (`--video_save_path`). You can pass two optional arguments, `--t5_cache_dir` and `--tokenizer_cache_dir`, to avoid re-downloading artifacts if they are already downloaded. |
| 159 | +
|
| 160 | +```bash |
| 161 | +uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \ |
| 162 | + examples/megatron/recipes/dit/inference_dit_model.py \ |
| 163 | + --t5_cache_dir $artifact_dir \ |
| 164 | + --tokenizer_cache_dir $tokenizer_cache_dir \ |
| 165 | + --tokenizer_model Cosmos-0.1-Tokenizer-CV4x8x8 \ |
| 166 | + --checkpoint_path $checkpoint_dir \ |
| 167 | + --num_video_frames 10 \ |
| 168 | + --height 240 \ |
| 169 | + --width 416 \ |
| 170 | + --video_save_path $save_path \ |
| 171 | + --prompt "$prompt" |
| 172 | +``` |
| 173 | +
|
| 174 | +--- |
| 175 | +
|
| 176 | +### ⚡ Parallelism Support |
| 177 | +
|
| 178 | +The table below shows current parallelism support for different model sizes: |
| 179 | +
|
| 180 | +| Model | Data Parallel | Tensor Parallel | Sequence Parallel | Context Parallel | |
| 181 | +|---|---|---|---|---| |
| 182 | +| **DiT-S (330M)** | TBD | TBD | TBD | TBD | |
| 183 | +| **DiT-L (450M)** | TBD | TBD | TBD| TBD | |
| 184 | +| **DiT-XL (700M)** | ✅ | ✅ | ✅ | ✅ | |
0 commit comments