NVIDIA-NeMo
diff --git a/‎docs/get-started/automodel.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/get-started/automodel.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/get-started/megatron-wan.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/get-started/megatron-wan.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/get-started/megatron.md‎
Lines changed: 7 additions & 0 deletions b/‎docs/get-started/megatron.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 49 additions & 0 deletions b/‎docs/index.md‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎docs/tutorials/fine-tuning-pretrained-models.md‎
Lines changed: 289 additions & 0 deletions b/‎docs/tutorials/fine-tuning-pretrained-models.md‎
Lines changed: 289 additions & 0 deletions
@@ -651,6 +651,7 @@ python dfm/examples/automodel/generate/wan_generate.py \
 
 ## Related Documentation
 
+- [Fine-Tuning Pretrained Models](../tutorials/fine-tuning-pretrained-models.md) - Comprehensive recipe with detailed configuration options
 - [Training Paradigms](../about/concepts/training-paradigms.md) - Understand AutoModel vs Megatron differences
 - [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
 - [AutoModel vs Megatron Comparison](../about/comparison.md) - Experimental comparison
 
@@ -287,6 +287,7 @@ The table below shows current parallelism support for different WAN model sizes:
 
 ## Related Documentation
 
+- [Text-to-Video Training](../tutorials/text-to-video-training.md) - Comprehensive recipe with detailed configuration options
 - [Training Paradigms](../about/concepts/training-paradigms.md) - Understand Megatron approach
 - [Distributed Training](../about/concepts/distributed-training.md) - Parallelism strategies
 - [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
 
@@ -581,3 +581,10 @@ The table below shows current parallelism support for different DiT model sizes:
 
 - [Megatron WAN Tutorial](megatron-wan.md) - Train WAN models for video generation
 - [AutoModel Tutorial](automodel.md) - Fine-tune models with automatic parallelism
+
+## Related Documentation
+
+- [Training from Scratch](../tutorials/training-from-scratch.md) - Comprehensive recipe with detailed configuration options
+- [Training Paradigms](../about/concepts/training-paradigms.md) - Understand Megatron approach
+- [Distributed Training](../about/concepts/distributed-training.md) - Parallelism strategies
+- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
@@ -107,6 +107,44 @@ Train WAN models for video generation with Megatron. Best for video-specific wor
 
 ---
 
+## Tutorials
+
+Comprehensive training recipes with detailed configurations and advanced topics for production workflows.
+
+::::{grid} 1 3 3 3
+:gutter: 1 1 1 2
+
+:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` Fine-Tuning Pretrained Models
+:link: tutorial-fine-tuning-pretrained-models
+:link-type: ref
+
+Fine-tune pretrained models with automatic parallelism. Advanced configuration options.
++++
+{bdg-secondary}`automodel` {bdg-success}`Quick start` {bdg-info}`Advanced`
+:::
+
+:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Training from Scratch
+:link: tutorial-training-from-scratch
+:link-type: ref
+
+Train DiT models from scratch with full distributed control. Sequence packing and Energon format details.
++++
+{bdg-secondary}`megatron` {bdg-info}`Full control` {bdg-info}`Advanced`
+:::
+
+:::{grid-item-card} {octicon}`video;1.5em;sd-mr-1` Text-to-Video Training
+:link: tutorial-text-to-video-training
+:link-type: ref
+
+Train production-scale text-to-video models. WebDataset preparation and inference workflows.
++++
+{bdg-secondary}`megatron` {bdg-info}`Video models` {bdg-info}`Advanced`
+:::
+
+::::
+
+---
+
 ::::{toctree}
 :hidden:
 Home <self>
@@ -133,6 +171,17 @@ Megatron DiT <get-started/megatron>
 Megatron WAN <get-started/megatron-wan>
 ::::
 
+::::{toctree}
+:hidden:
+:caption: Tutorials
+:maxdepth: 2
+
+tutorials/index.md
+Fine-Tuning Pretrained Models <tutorials/fine-tuning-pretrained-models>
+Training from Scratch <tutorials/training-from-scratch>
+Text-to-Video Training <tutorials/text-to-video-training>
+::::
+
 ::::{toctree}
 :hidden:
 :caption: Reference
 
@@ -0,0 +1,289 @@
+---
+description: "Comprehensive guide for fine-tuning pretrained models with automatic parallelism and advanced configuration options"
+categories: ["tutorials", "automodel"]
+tags: ["training", "recipe", "automodel", "wan", "advanced"]
+personas: ["mle-focused", "data-scientist-focused"]
+difficulty: "intermediate"
+content_type: "tutorial"
+---
+
+(tutorial-fine-tuning-pretrained-models)=
+
+# Fine-Tuning Pretrained Models
+
+Comprehensive guide for fine-tuning pretrained diffusion models with automatic parallelism and distributed training support. This approach uses NeMo Automodel backend, which handles parallelism automatically—ideal for quick prototyping and fine-tuning workflows.
+
+**Currently Supported:** WAN 2.1 Text-to-Video (1.3B and 14B models)
+
+:::{note}
+For a quick start guide, see [Automodel Workflow](../get-started/automodel.md). This tutorial provides detailed configuration options and advanced topics.
+:::
+
+---
+
+## Quick Start
+
+### 1. Docker Setup
+
+```bash
+# Build image
+docker build -f docker/Dockerfile.ci -t dfm-training .
+
+# Run container
+docker run --gpus all -it \
+  -v $(pwd):/workspace \
+  -v /path/to/data:/data \
+  --ipc=host \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  dfm-training bash
+
+# Inside container: Initialize submodules
+export UV_PROJECT_ENVIRONMENT=
+git submodule update --init --recursive 3rdparty/
+```
+
+### 2. Prepare Data
+
+We provide two ways to prepare your dataset:
+
+- Start with raw videos: Place your `.mp4` files in a folder and use our data-preparation scripts to scan the videos and generate a `meta.json` entry for each sample (which includes `width`, `height`, `start_frame`, `end_frame`, and a caption). If you have captions, you can also include per-video named `<video>.jsonl`; the scripts will pick up the text automatically. The final dataset layout is shown below.
+- Bring your own `meta.json`: If you already have annotations, create `meta.json` yourself following the schema shown below.
+
+**Create video dataset:**
+In the following example we use two video files, solely for demonstration purposes. Actual training datasets will have a large number of files.
+```
+<your_video_folder>/
+├── video1.mp4
+├── video2.mp4
+└── meta.json
+```
+
+**meta.json format:**
+```json
+[
+  {
+    "file_name": "video1.mp4",
+    "width": 1280,
+    "height": 720,
+    "start_frame": 0,
+    "end_frame": 121,
+    "vila_caption": "A detailed description of the video1.mp4 contents..."
+  },
+  {
+    "file_name": "video2.mp4",
+    "width": 1280,
+    "height": 720,
+    "start_frame": 0,
+    "end_frame": 12,
+    "vila_caption": "A detailed description of the video2.mp4 contents..."
+  }
+]
+```
+
+**Preprocess videos to .meta files:**
+
+The preprocessing script converts each source video into a single `.meta` file that preserves the full temporal sequence as latents. Training can sample temporal windows/clips from the sequence on the fly.
+
+```bash
+python dfm/src/automodel/utils/data/preprocess_resize.py \
+  --video_folder <your_video_folder> \
+  --output_folder ./processed_meta \
+  --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
+  --height 480 \
+  --width 720 \
+  --center-crop
+```
+
+**Key arguments:**
+- `--video_folder`: Path to folder containing videos and `meta.json`
+- `--output_folder`: Path where `.meta` files will be saved
+- `--model`: Wan2.1 model ID (default: `Wan-AI/Wan2.1-T2V-14B-Diffusers`)
+- `--height/--width`: Target resolution (both must be specified together)
+- `--center-crop`: Crop to exact size after aspect-preserving resize
+- `--device`: Device to use (`cuda` or `cpu`, default: `cuda` if available)
+- `--stochastic`: Use stochastic encoding instead of deterministic (may cause flares)
+- `--no-memory-optimization`: Disable Wan's built-in memory optimization
+
+**Output:** Creates `.meta` files containing:
+- Encoded video latents (normalized)
+- Text embeddings (from UMT5)
+- First frame as JPEG
+- Metadata
+
+### 3. Train
+
+**Single-node (8 GPUs):**
+```bash
+export UV_PROJECT_ENVIRONMENT=
+
+uv run --group automodel --with . \
+  torchrun --nproc-per-node=8 \
+  examples/automodel/finetune/finetune.py \
+  -c examples/automodel/finetune/wan2_1_t2v_flow.yaml
+```
+
+**Multi-node with SLURM:**
+```bash
+#!/bin/bash
+#SBATCH -N 2
+#SBATCH --ntasks-per-node 1
+#SBATCH --gpus-per-node=8
+#SBATCH --exclusive
+
+export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+export MASTER_PORT=29500
+export NUM_GPUS=8
+
+# Per-rank UV cache to avoid conflicts
+unset UV_PROJECT_ENVIRONMENT
+mkdir -p /opt/uv_cache/${SLURM_JOB_ID}_${SLURM_PROCID}
+export UV_CACHE_DIR=/opt/uv_cache/${SLURM_JOB_ID}_${SLURM_PROCID}
+
+uv run --group automodel --with . \
+  torchrun \
+  --nnodes=$SLURM_NNODES \
+  --nproc-per-node=$NUM_GPUS \
+  --rdzv_backend=c10d \
+  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+  examples/automodel/finetune/finetune.py \
+  -c examples/automodel/finetune/wan2_1_t2v_flow_multinode.yaml
+```
+
+### 4. Validate
+
+Use this step to perform a quick qualitative check of a trained checkpoint. The validation script:
+- Reads prompts from `.meta` files in `--meta_folder` (uses `metadata.vila_caption`; latents are ignored).
+- Loads the `WanPipeline` and, if provided, restores weights from `--checkpoint` (prefers `ema_shadow.pt`, then `consolidated_model.bin`, then sharded FSDP `model/*.distcp`).
+- Generates short videos for each prompt with the specified settings (`--guidance_scale`, `--num_inference_steps`, `--height/--width`, `--num_frames`, `--fps`, `--seed`) and writes them to `--output_dir`.
+- Intended for qualitative comparison across checkpoints; it does not compute quantitative metrics.
+
+```bash
+uv run --group automodel --with . \
+  python examples/automodel/generate/wan_validate.py \
+  --meta_folder <your_meta_folder> \
+  --guidance_scale 5 \
+  --checkpoint ./checkpoints/step_1000 \
+  --num_samples 10
+```
+
+**Note:** You can use `--checkpoint ./checkpoints/LATEST` to automatically use the most recent checkpoint.
+
+---
+
+## Configuration
+
+### Fine-tuning Config (`wan2_1_t2v_flow.yaml`)
+
+Note: The inline configuration below is provided for quick reference. The canonical, up-to-date files are maintained in the repository: [examples/automodel/](../../examples/automodel/), [examples/automodel/finetune/wan2_1_t2v_flow.yaml](../../examples/automodel/finetune/wan2_1_t2v_flow.yaml), and [examples/automodel/finetune/wan2_1_t2v_flow_multinode.yaml](../../examples/automodel/finetune/wan2_1_t2v_flow_multinode.yaml).
+
+```yaml
+model:  # Base pretrained model to fine-tune
+  pretrained_model_name_or_path: Wan-AI/Wan2.1-T2V-1.3B-Diffusers  # HF repo or local path
+
+step_scheduler:  # Global training schedule
+  global_batch_size: 8  # Effective batch size across all GPUs
+  local_batch_size: 1  # Per-GPU batch size
+  num_epochs: 10  # Number of passes over the dataset
+  ckpt_every_steps: 100  # Save a checkpoint every N steps
+
+data:  # Data input configuration
+  dataloader:  # DataLoader parameters
+    meta_folder: "<your_processed_meta_folder>"  # Folder containing .meta files
+    num_workers: 2  # Worker processes per rank
+
+optim:  # Optimizer/training hyperparameters
+  learning_rate: 5e-6  # Base learning rate
+
+flow_matching:  # Flow-matching training settings
+  timestep_sampling: "uniform"  # Strategy for sampling timesteps
+  flow_shift: 3.0  # Scalar shift applied to the target flow
+
+fsdp:  # Distributed training (e.g., FSDP) configuration
+  dp_size: 8  # Total data-parallel replicas (single node: 8 GPUs)
+
+checkpoint:  # Checkpointing behavior
+  enabled: true  # Enable periodic checkpoint saving
+  checkpoint_dir: "./checkpoints"  # Output directory for checkpoints
+```
+
+### Multi-node Config Differences
+
+```yaml
+fsdp:  # Overrides for multi-node runs
+  dp_size: 16           # Total data-parallel replicas (2 nodes × 8 GPUs)
+  dp_replicate_size: 2  # Number of replicated groups across nodes
+```
+
+### Pretraining vs Fine-tuning
+
+| Setting | Fine-tuning | Pretraining |
+|---------|-------------|-------------|
+| `learning_rate` | 5e-6 | 5e-5 |
+| `weight_decay` | 0.01 | 0.1 |
+| `flow_shift` | 3.0 | 2.5 |
+| `logit_std` | 1.0 | 1.5 |
+| Dataset size | 100s-1000s | 10K+ |
+
+---
+
+## Hardware Requirements
+
+| Component | Minimum | Recommended |
+|-----------|---------|-------------|
+| GPU | A100 40GB | A100 80GB / H100 |
+| GPUs | 4 | 8+ |
+| RAM | 128 GB | 256 GB+ |
+| Storage | 500 GB SSD | 2 TB NVMe |
+
+---
+
+## Features
+
+- ✅ **Flow Matching**: Pure flow matching training
+- ✅ **Distributed**: FSDP2 + Tensor Parallelism
+- ✅ **Mixed Precision**: BF16 by default
+- ✅ **WandB**: Automatic logging
+- ✅ **Checkpointing**: consolidated, and sharded formats
+- ✅ **Multi-node**: SLURM and torchrun support
+
+---
+
+## Supported Models
+
+| Model | Parameters | Parallelization | Status |
+|-------|------------|-----------------|--------|
+| Wan 2.1 T2V 1.3B | 1.3B | FSDP2 via Automodel + DDP | ✅ |
+| Wan 2.1 T2V 14B | 14B | FSDP2 via Automodel + DDP | ✅ |
+| FLUX | TBD | TBD | 🔄 In Progress |
+
+---
+
+## Advanced
+
+**Custom parallelization:**
+```yaml
+fsdp:
+  tp_size: 2  # Tensor parallel
+  dp_size: 4  # Data parallel
+```
+
+**Checkpoint cleanup:**
+```python
+from pathlib import Path
+import shutil
+
+def cleanup_old_checkpoints(checkpoint_dir, keep_last_n=3):
+    checkpoints = sorted(Path(checkpoint_dir).glob("step_*"))
+    for old_ckpt in checkpoints[:-keep_last_n]:
+        shutil.rmtree(old_ckpt)
+```
+
+---
+
+## Related Documentation
+
+- [Automodel Quick Start](../get-started/automodel.md) - Quick start guide
+- [Training Paradigms](../about/concepts/training-paradigms.md) - Understanding Automodel vs Megatron approaches
+- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
+