diff --git a/docs/megatron/models/dit/README.md b/docs/megatron/models/dit/README.md new file mode 100644 index 00000000..98e556e9 --- /dev/null +++ b/docs/megatron/models/dit/README.md @@ -0,0 +1,184 @@ +## ๐Ÿš€ Megatron DiT + +### ๐Ÿ“‹ Overview +An open-source implementation of [Diffusion Transformers (DiTs)](https://github.com/facebookresearch/DiT) for training text-to-image/video models with [EDMPipeline](https://arxiv.org/abs/2206.00364). The implementation is based on [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) and [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) to bring both scalability and efficiency. Various parallelization techniques such as tensor, sequence, and context parallelism are currently supported. + +--- + +### ๐Ÿ“ฆ Dataset Preparation +This recipe uses NVIDIA's [Megatron-Energon](https://github.com/NVIDIA/Megatron-Energon) as an efficient multi-modal data loader. Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon efficiently supports large-scale distributed loading, sharding, and sampling for multi-modal pairs (e.g., text-image, text-video). Set `dataset.path` to your WebDataset location or shard pattern. See the Megatron-Energon documentation for format details and advanced options. + +#### ๐Ÿฆ‹ Dataset Preparation Example + +As an example, you can use the [butterfly-dataset](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) available on Hugging Face. + +The script below packs the Hugging Face dataset into WebDataset format, which Energon requires. +```bash +uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \ + examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py +``` + +In case you already have the T5 model or video tokenizer downloaded, you can point to them with optional arguments `--t5_cache_dir` and `--tokenizer_cache_dir`. + + +```bash +uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \ + examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py \ + --t5_cache_dir $t5_cache_dir \ + --tokenizer_cache_dir $tokenizer_cache_dir +``` + +Then you need to run `energon prepare $dataset_path` and choose `CrudeWebdataset` as the sample type: + +```bash +energon prepare ./ + import pynvml # type: ignore[import] +Found 8 tar files in total. The first and last ones are: +- rank0-000000.tar +- rank7-000000.tar +If you want to exclude some of them, cancel with ctrl+c and specify an exclude filter in the command line. +Please enter a desired train/val/test split like "0.5, 0.2, 0.3" or "8,1,1": 1,0,0 +Saving info to /opt/datasets/butterfly_webdataset_new/.nv-meta/.info.yaml +Sample 0, keys: + - json + - pickle + - pth +Json content of sample 0 of rank0-000000.tar: +{ + "image_height": 1, + "image_width": 512 +} +Sample 1, keys: + - json + - pickle + - pth +Json content of sample 1 of rank0-000000.tar: +{ + "image_height": 1, + "image_width": 512 +} +Found the following part types in the dataset: pth, json, pickle +Do you want to create a dataset.yaml interactively? [Y/n]: y +The following sample types are available: +0. CaptioningSample +1. ImageClassificationSample +2. ImageSample +3. InterleavedSample +4. MultiChoiceVQASample +5. OCRSample +6. Sample +7. SimilarityInterleavedSample +8. TextSample +9. VQASample +10. VidQASample +11. Crude sample (plain dict for cooking) +Please enter a number to choose a class: 11 +CrudeWebdataset does not need a field map. You will need to provide a `Cooker` for your dataset samples in your `TaskEncoder`. +Furthermore, you might want to add `subflavors` in your meta dataset specification. +Done +``` + +--- + +### ๐Ÿณ Build Container + +Please follow the instructions in the [container](https://github.com/NVIDIA-NeMo/DFM#-built-your-own-container) section of the main README. + +--- + +### ๐Ÿ‹๏ธ Pretraining + +Once you have the dataset and container ready, you can start training the DiT model on your own dataset. This repository leverages [sequence packing](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/features/optimizations/sequence_packing.html) to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length; therefore, `micro_batch_size` must be set to 1. Additionally, `qkv_format` should be set to `thd` to signal to Transformer Engine that sequence packing is enabled. + +For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters. For further details you can look at [Energon packing](https://nvidia.github.io/Megatron-Energon/advanced/packing.html) documenation. + +Multiple parallelism techniques including tensor, sequence, and context parallelism are supported and can be configured based on your computational requirements. + +The model architecture can be customized through parameters such as `num_layers` and `num_attention_heads`. A comprehensive list of configuration options is available in the [Megatron-Bridge documentation](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docs/megatron-lm-to-megatron-bridge.md). + + +**Note:** If using the `wandb_project` and `wandb_exp_name` arguments, ensure the `WANDB_API_KEY` environment variable is set. + + +**Note:** During validation, the model generates one sample per GPU at the start of each validation round. These samples are saved to a `validation_generation` folder within `checkpoint_dir` and are also logged to Wandb if the `WANDB_API_KEY` environment variable is configured. To decode the generated latent samples, the model requires access to the video tokenizer used during dataset preparation. Specify the VAE artifacts location using the `vae_cache_folder` argument, otherwise they will be downloaded in the first validation round. + +#### Pretraining script example +First, copy the example config file and update it with your own settings: + +```bash +cp examples/megatron/recipes/dit/conf/dit_pretrain_example.yaml examples/megatron/recipes/dit/conf/my_config.yaml +# Edit my_config.yaml to set: +# - model.vae_cache_folder: Path to VAE cache folder +# - dataset.path: Path to your dataset folder +# - checkpoint.save and checkpoint.load: Path to checkpoint folder +# - train.global_batch_size: Set to match be divisible by NUM_GPUs +# - logger.wandb_exp_name: Your experiment name +``` + +Then run: + +```bash +uv run --group megatron-bridge python -m torch.distributed.run \ + --nproc-per-node $NUM_GPUS examples/megatron/recipes/dit/pretrain_dit_model.py \ + --config-file examples/megatron/recipes/dit/conf/my_config.yaml +``` + +You can still override any config values from the command line: + +```bash +uv run --group megatron-bridge python -m torch.distributed.run \ + --nproc-per-node $num_gpus examples/megatron/recipes/dit/pretrain_dit_model.py \ + --config-file examples/megatron/recipes/dit/conf/my_config.yaml \ + train.train_iters=20000 \ + model.num_layers=32 +``` + +**Note:** If you dedicate 100% of the data to training, you need to pass `dataset.use_train_split_for_val=true` to use a subset of training data for validation purposes. + +```bash +uv run --group megatron-bridge python -m torch.distributed.run \ + --nproc-per-node $num_gpus examples/megatron/recipes/dit/pretrain_dit_model.py \ + --config-file examples/megatron/recipes/dit/conf/my_config.yaml \ + dataset.use_train_split_for_val=true +``` + +#### ๐Ÿงช Quick Start with Mock Dataset + +If you want to run the code without having the dataset ready (for performance measurement purposes, for example), you can pass the `--mock` flag to activate a mock dataset. + +```bash +uv run --group megatron-bridge python -m torch.distributed.run \ + --nproc-per-node $num_gpus examples/megatron/recipes/dit/pretrain_dit_model.py \ + --config-file examples/megatron/recipes/dit/conf/dit_pretrain.yaml \ + --mock +``` + +### ๐ŸŽฌ Inference + +Once training completes, you can run inference using [inference_dit_model.py](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py). The script requires your trained model checkpoint (`--checkpoint_path`) and a path to save generated videos (`--video_save_path`). You can pass two optional arguments, `--t5_cache_dir` and `--tokenizer_cache_dir`, to avoid re-downloading artifacts if they are already downloaded. + +```bash +uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \ + examples/megatron/recipes/dit/inference_dit_model.py \ + --t5_cache_dir $artifact_dir \ + --tokenizer_cache_dir $tokenizer_cache_dir \ + --tokenizer_model Cosmos-0.1-Tokenizer-CV4x8x8 \ + --checkpoint_path $checkpoint_dir \ + --num_video_frames 10 \ + --height 240 \ + --width 416 \ + --video_save_path $save_path \ + --prompt "$prompt" +``` + +--- + +### โšก Parallelism Support + +The table below shows current parallelism support for different model sizes: + +| Model | Data Parallel | Tensor Parallel | Sequence Parallel | Context Parallel | +|---|---|---|---|---| +| **DiT-S (330M)** | TBD | TBD | TBD | TBD | +| **DiT-L (450M)** | TBD | TBD | TBD| TBD | +| **DiT-XL (700M)** | โœ… | โœ… | โœ… | โœ… | diff --git a/examples/megatron/recipes/dit/README.md b/examples/megatron/recipes/dit/README.md deleted file mode 100644 index baef566f..00000000 --- a/examples/megatron/recipes/dit/README.md +++ /dev/null @@ -1,77 +0,0 @@ -# DiT (Diffusion Transformer) Model Setup - -This guide provides instructions for setting up and running the DiT model on the butterfly dataset. - -## Overview - -Megatron-LM and Megatron-Bridge coming with the docker image are not compatible with DiT model. This setup guide will walk you through configuring the environment properly. - -## Setup Instructions - -### 1. Clone Required Repositories - -The following repositories need to be cloned with specific commit hashes: - -#### Megatron-LM -```bash -git clone https://github.com/NVIDIA/Megatron-LM.git -cd Megatron-LM -git checkout aecce9e95624ddfedbd2bd3ce599e36cd96da065 -cd .. -``` - -#### Megatron-Bridge -```bash -git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git -cd Megatron-Bridge -git checkout 83f90524f9bc467e8f864e2f9bd1da1246594ab9 -cd .. -``` - -#### DFM Repository -```bash -git clone https://github.com/NVIDIA-NeMo/DFM.git -cd DFM -git checkout dit_debug -cd .. -``` - -### 2. Dataset Location - -The butterfly webdataset is accesible on eos clusters in the path below: -``` -/home/snorouzi/code/butterfly_webdataset -``` - -## Docker Setup - -Run the following Docker command to start the container with all necessary volume mounts: - -```bash -sudo docker run --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -w /opt/dfm --rm \ - -v ${DATA_PATH}/butterfly_webdataset:/opt/VFM/butterfly_webdataset \ - -v ${CODE_PATH}/Megatron-LM:/opt/megatron-lm \ - -v ${CODE_PATH}/Megatron-Bridge/:/opt/Megatron-Bridge/ \ - -v ${CODE_PATH}/DFM:/opt/dfm \ - -it nvcr.io/nvidian/nemo:25.09.rc6 -``` - -**Note:** Set the `DATA_PATH` and `CODE_PATH` environment variables to point to your local directories before running this command. - -## Installation - -Once inside the container, install the required Python packages: - -```bash -pip install --upgrade transformers -pip install imageio==2.24 -pip install imageio[ffmpeg] -``` - -## Running the Model - -Execute the DiT model training with the following command: - -```bash -torchrun --nproc-per-node 2 examples/megatron/recipes/dit/pretrain_dit_model.py --dataset_path "/opt/VFM/butterfly_webdataset" -``` diff --git a/examples/megatron/recipes/dit/conf/dit_pretrain.yaml b/examples/megatron/recipes/dit/conf/dit_pretrain.yaml new file mode 100644 index 00000000..a8e0d6ab --- /dev/null +++ b/examples/megatron/recipes/dit/conf/dit_pretrain.yaml @@ -0,0 +1,41 @@ +# DiT Pretraining Configuration +# This file contains all the configuration parameters for DiT pretraining +# You can override any of these values via command line using Hydra-style syntax + +# Model configuration +model: + tensor_model_parallel_size: 1 + sequence_parallel: false + context_parallel_size: 1 + qkv_format: thd # Must be 'thd' for sequence packing + num_attention_heads: 16 + vae_cache_folder: null # Set to your VAE cache folder path + +# Dataset configuration +dataset: + path: DATASET_FOLDER # Set to your dataset folder path + task_encoder_seq_length: 15360 + packing_buffer_size: 100 + num_workers: 20 + +# Checkpoint configuration +checkpoint: + save: "dfm_experiment" # Set to your checkpoint folder path + load: "dfm_experiment" # Set to your checkpoint folder path (same as save for resuming) + load_optim: true + save_interval: 1000 + +# Training configuration +train: + eval_interval: 1000 + train_iters: 10000 + eval_iters: 32 + global_batch_size: 8 # Set this to match NUM_GPUS or your desired batch size + micro_batch_size: 1 # Must be 1 for sequence packing + +# Logger configuration +logger: + log_interval: 10 + # remove wandb_project and wandb_exp_name to disable wandb logging + wandb_project: "DiT" + wandb_exp_name: "dfm_experiment" # Set to your experiment name