Skip to content

Conversation

@sajadn
Copy link
Contributor

@sajadn sajadn commented Nov 18, 2025

  • add dit readme.

Signed-off-by: sajadn <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments also please check these comments: https://github.com/NVIDIA-NeMo/DFM/pull/38/files

Signed-off-by: Sajad Norouzi <[email protected]>
Signed-off-by: Sajad Norouzi <[email protected]>
@sajadn sajadn requested a review from abhinavg4 November 20, 2025 23:21
abhinavg4
abhinavg4 previously approved these changes Nov 21, 2025
Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great documentation. Loved it. Left a few minor comments.

The following repositories need to be cloned with specific commit hashes:

#### Megatron-LM
The script below prepares the dataset to be compatible with Energon.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: packs the HF dataset into Webdataset format. Format compatible with Energon sounds something mystic

git checkout dit_debug
cd ..
energon prepare ./
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this print. Not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want me to get rid of the whole prints of energon prepare ./?

Once you have the dataset and container ready, you can start training the DiT model on your own dataset. This repository leverages [sequence packing](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/features/optimizations/sequence_packing.html) to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length; therefore, `micro_batch_size` must be set to 1. Additionally, `qkv_format` should be set to `thd` to signal to Transformer Engine that sequence packing is enabled.
For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great section. Also provide link to energon documentation for these params?

@sajadn sajadn enabled auto-merge (squash) November 21, 2025 15:57
Signed-off-by: Sajad Norouzi <[email protected]>
@abhinavg4
Copy link
Contributor

/ok to test 045f421

@@ -1,77 +1,184 @@
# DiT (Diffusion Transformer) Model Setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this under docs/megatron/models/DiT/dit.md

Signed-off-by: Sajad Norouzi <[email protected]>
Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@abhinavg4
Copy link
Contributor

/ok to test 7a832f6

@sajadn sajadn merged commit 2489a8e into main Dec 1, 2025
15 checks passed
lbliii pushed a commit that referenced this pull request Dec 3, 2025
* Add DiT Readme.

Signed-off-by: sajadn <[email protected]>

* Update DiT readme.

Signed-off-by: Sajad Norouzi <[email protected]>

* Minor wording update.

Signed-off-by: Sajad Norouzi <[email protected]>

---------

Signed-off-by: sajadn <[email protected]>
Signed-off-by: Sajad Norouzi <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants