updates

lbliii · lbliii · commit 78456b7ae90e · 2025-12-03T13:31:47.000-05:00
Signed-off-by: Lawrence Lane &lt;llane@nvidia.com&gt;
diff --git a/docs/tutorials/fine-tuning-pretrained-models.md b/docs/tutorials/fine-tuning-pretrained-models.md
@@ -102,7 +102,7 @@ python dfm/src/automodel/utils/data/preprocess_resize.py \
 - `--height/--width`: Target resolution (both must be specified together)
 - `--center-crop`: Crop to exact size after aspect-preserving resize
 - `--device`: Device to use (`cuda` or `cpu`, default: `cuda` if available)
-- `--stochastic`: Use stochastic encoding instead of deterministic (may cause flares)
+- `--stochastic`: Use stochastic encoding instead of deterministic (can cause flares)
 - `--no-memory-optimization`: Disable Wan's built-in memory optimization
 
 **Output:** Creates `.meta` files containing:
@@ -199,7 +199,7 @@ flow_matching:  # Flow-matching training settings
   timestep_sampling: "uniform"  # Strategy for sampling timesteps
   flow_shift: 3.0  # Scalar shift applied to the target flow
 
-fsdp:  # Distributed training (e.g., FSDP) configuration
+fsdp:  # Distributed training (for example, FSDP) configuration
   dp_size: 8  # Total data-parallel replicas (single node: 8 GPUs)
 
 checkpoint:  # Checkpointing behavior
@@ -253,8 +253,8 @@ fsdp:  # Overrides for multi-node runs
 
 | Model | Parameters | Parallelization | Status |
 |-------|------------|-----------------|--------|
-| Wan 2.1 T2V 1.3B | 1.3B | FSDP2 via Automodel + DDP | ✅ |
-| Wan 2.1 T2V 14B | 14B | FSDP2 via Automodel + DDP | ✅ |
+| Wan 2.1 T2V 1.3B | 1.3B | FSDP2 using Automodel + DDP | ✅ |
+| Wan 2.1 T2V 14B | 14B | FSDP2 using Automodel + DDP | ✅ |
 | FLUX | TBD | TBD | 🔄 In Progress |
 
 ---
diff --git a/docs/tutorials/text-to-video-training.md b/docs/tutorials/text-to-video-training.md
@@ -11,7 +11,7 @@ content_type: "tutorial"
 
 # Text-to-Video Training
 
-Comprehensive guide for training large-scale text-to-video generation models using WAN 2.1 architecture. This approach uses Megatron-Core and Megatron-Bridge for scalable training with advanced parallelism strategies (data, tensor, sequence, and context parallelism) and optimized kernels (e.g., Transformer Engine fused attention).
+Comprehensive guide for training large-scale text-to-video generation models using WAN 2.1 architecture. This approach uses Megatron-Core and Megatron-Bridge for scalable training with advanced parallelism strategies (data, tensor, sequence, and context parallelism) and optimized kernels (for example, Transformer Engine fused attention).
 
 **Use case**: Train production-scale text-to-video models with full control over distributed training parallelism.
 
@@ -54,15 +54,15 @@ uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node
 # 4) Use Energon to process shards and create its metadata/spec
 energon prepare "${DATASET_PATH}"
 # In the interactive prompts:
-# - Enter a train/val/test split, e.g., "8,1,1"
+# - Enter a train/val/test split, for example, "8,1,1"
 # - When asked for the sample type, choose: "Crude sample (plain dict for cooking)"
 ```
 
 What gets produced:
 - Each shard contains:
   - pth: contain WAN video latents
   - pickle: contain text embeddings
-  - json: contain useful side-info (text caption, sizes, processing choices, etc.)
+  - json: contain useful side-info (text caption, sizes, processing choices, and so on)
 - Energon writes a `.nv-meta` directory with dataset info and a `dataset.yaml` you can version/control.
 
 You're ready to launch training. In the training config, we will point the WAN config (or CLI overrides) to the processed data output directory as `dataset.path=${DATASET_PATH}`.
@@ -71,9 +71,7 @@ You're ready to launch training. In the training config, we will point the WAN c
 
 ## Build Container
 
-Please follow the instructions in the container section of the main README:
-
-- DFM container guide: https://github.com/NVIDIA-NeMo/DFM#-built-your-own-container
+Follow the instructions in the [container section](https://github.com/NVIDIA-NeMo/DFM#-built-your-own-container) of the main README.
 
 ---
 
@@ -87,13 +85,13 @@ Multiple parallelism techniques including tensor, sequence, and context parallel
 
 Wan training is driven by `examples/megatron/recipes/wan/pretrain_wan.py`, which supports both a YAML config file and CLI overrides.
 
-The script exposes a `--training-mode` with `pretrain` and `finetune` presets for flow-matching hyperparameters as a starting point for experiments. This presets specify that pretraining uses noisier, biased sampling (e.g., logit-normal, higher logit_std, lower flow_shift) for stability and broad learning, while finetuning uses uniform, lower-noise settings (e.g., uniform sampling, lower logit_std, higher flow_shift) to refine details and improve quality.
+The script exposes a `--training-mode` with `pretrain` and `finetune` presets for flow-matching hyperparameters as a starting point for experiments. This presets specify that pretraining uses noisier, biased sampling (for example, logit-normal, higher logit_std, lower flow_shift) for stability and broad learning, while finetuning uses uniform, lower-noise settings (for example, uniform sampling, lower logit_std, higher flow_shift) to refine details and improve quality.
 
 **Notes**: If you use `logger.wandb_project` and `logger.wandb_exp_name`, export `WANDB_API_KEY`.
 
 ### Pretraining Script Example
 
-We provide example scripts for running 1.3B and 14B model sizes on mock dataset (see `wan_1_3B.yaml` and `wan_14B.yaml` under `examples/megatron/recipes/wan/conf`). From these starting points, users can set their own configuration by copy one of the example override configs and update it with your settings (e.g., with actual processed data path, and specific configurations based on available hardware, etc.). Users can learn more about arguments detail at [Megatron-Bridge docs](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docs/megatron-lm-to-megatron-bridge.md).
+We provide example scripts for running 1.3B and 14B model sizes on mock dataset (see `wan_1_3B.yaml` and `wan_14B.yaml` under `examples/megatron/recipes/wan/conf`). From these starting points, users can set their own configuration by copy one of the example override configs and update it with your settings (for example, with actual processed data path, and specific configurations based on available hardware, and so on). Users can learn more about arguments detail at [Megatron-Bridge docs](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docs/megatron-lm-to-megatron-bridge.md).
 
 ```bash
 cp examples/megatron/recipes/wan/conf/wan_1_3B.yaml examples/megatron/recipes/wan/conf/my_wan.yaml
@@ -141,7 +139,7 @@ uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node
   --mock
 ```
 
-You may adjust mock shapes (`F_latents`, `H_latents`, `W_latents`) and packing behavior (`number_packed_samples`) in `WanMockDataModuleConfig` (see `dfm/src/megatron/recipes/wan/wan.py`) to simulate different data scenarios.
+You can adjust mock shapes (`F_latents`, `H_latents`, `W_latents`) and packing behavior (`number_packed_samples`) in `WanMockDataModuleConfig` (see `dfm/src/megatron/recipes/wan/wan.py`) to simulate different data scenarios.
 
 ---
 
@@ -178,7 +176,7 @@ The table below shows current parallelism support for different model sizes:
 
 ## References
 
-Wan Team. (2025). Wan: Open and advanced large-scale video generative models (WAN 2.1). GitHub. https://github.com/Wan-Video/Wan2.1/
+Wan Team. (2025). [Wan: Open and advanced large-scale video generative models (WAN 2.1)](https://github.com/Wan-Video/Wan2.1/). GitHub.
 
 ---
 
diff --git a/docs/tutorials/training-from-scratch.md b/docs/tutorials/training-from-scratch.md
@@ -23,7 +23,7 @@ For a quick start guide, see [Megatron Workflow](../get-started/megatron.md). Th
 
 ## Dataset Preparation
 
-This recipe uses NVIDIA's [Megatron-Energon](https://github.com/NVIDIA/Megatron-Energon) as an efficient multi-modal data loader. Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon efficiently supports large-scale distributed loading, sharding, and sampling for multi-modal pairs (e.g., text-image, text-video). Set `dataset.path` to your WebDataset location or shard pattern. See the Megatron-Energon documentation for format details and advanced options.
+This recipe uses NVIDIA's [Megatron-Energon](https://github.com/NVIDIA/Megatron-Energon) as an efficient multi-modal data loader. Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon efficiently supports large-scale distributed loading, sharding, and sampling for multi-modal pairs (for example, text-image, text-video). Set `dataset.path` to your WebDataset location or shard pattern. See the Megatron-Energon documentation for format details and advanced options.
 
 ### Dataset Preparation Example
 
@@ -98,13 +98,13 @@ Done
 
 ## Build Container
 
-Please follow the instructions in the [container](https://github.com/NVIDIA-NeMo/DFM#-built-your-own-container) section of the main README.
+Follow the instructions in the [container](https://github.com/NVIDIA-NeMo/DFM#-built-your-own-container) section of the main README.
 
 ---
 
 ## Pretraining
 
-Once you have the dataset and container ready, you can start training the DiT model on your own dataset. This repository leverages [sequence packing](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/features/optimizations/sequence_packing.html) to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length; therefore, `micro_batch_size` must be set to 1. Additionally, `qkv_format` should be set to `thd` to signal to Transformer Engine that sequence packing is enabled.
+After you have the dataset and container ready, you can start training the DiT model on your own dataset. This repository leverages [sequence packing](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/features/optimizations/sequence_packing.html) to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length; therefore, `micro_batch_size` must be set to 1. Additionally, `qkv_format` should be set to `thd` to signal to Transformer Engine that sequence packing is enabled.
 
 For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters. For further details you can look at [Energon packing](https://nvidia.github.io/Megatron-Energon/advanced/packing.html) documentation.
 
@@ -172,7 +172,7 @@ uv run --group megatron-bridge python -m torch.distributed.run \
 
 ## Inference
 
-Once training completes, you can run inference using [inference_dit_model.py](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py). The script requires your trained model checkpoint (`--checkpoint_path`) and a path to save generated videos (`--video_save_path`). You can pass two optional arguments, `--t5_cache_dir` and `--tokenizer_cache_dir`, to avoid re-downloading artifacts if they are already downloaded.
+After training completes, you can run inference using [inference_dit_model.py](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py). The script requires your trained model checkpoint (`--checkpoint_path`) and a path to save generated videos (`--video_save_path`). You can pass two optional arguments, `--t5_cache_dir` and `--tokenizer_cache_dir`, to avoid re-downloading artifacts if they are already downloaded.
 
 ```bash
 uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \