|
| 1 | +# Wan 2.1 – Partial Convergence Comparison |
| 2 | +### Diffusers (Automodel path) vs. Megatron-Core (Megatron-Bridge path) |
| 3 | + |
| 4 | +--- |
| 5 | + |
| 6 | +## 1. Experiment Overview |
| 7 | +- Goal: Compare two training paths for Wan 2.1: |
| 8 | + **(1) [Diffusers](https://huggingface.co/docs/diffusers/en/index) implementaion + [Automodel](https://github.com/NVIDIA-NeMo/Automodel/tree/diffusion) training path** vs. **(2) [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) implementaion + [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) training path** |
| 9 | +- Two-stage training: |
| 10 | + - **Stage 1:** Text → Image - Learn to connect textual embeddings with visual concepts. |
| 11 | + - **Stage 2:** Text → Video - Learn visual movements aligning with prompts. |
| 12 | +- Dataset: 3,000 videos; frames extracted from videos are used for text-to-image training stage. |
| 13 | + |
| 14 | + |
| 15 | +## 2. Dataset |
| 16 | + |
| 17 | +### Stage 1 (Text-to-Image) |
| 18 | +- Extract 40 frames per video → **120k images** |
| 19 | +- Resolution: **240 × 416** |
| 20 | +- Each frame uses same caption as parent video. |
| 21 | + |
| 22 | +### Stage 2 (Text-to-Video) |
| 23 | +- Full videos → **3,000 videos** |
| 24 | +- Resolution: **240 × 416**, duration 4–8 seconds. |
| 25 | + |
| 26 | +**Note**: This experiment is a partial convergence test and only demonstrates the model's ability to reconstruct images and videos from input prompts. With only 3,000 videos, the model cannot generalize to generate novel content. Such generalization can be achieved with larger training dataset and increased training resources. |
| 27 | + |
| 28 | +## 3. Training Setup |
| 29 | + |
| 30 | +### Stage 1 |
| 31 | +- Global batch size: 2560 images |
| 32 | +- Learning rate: warmup 10k → 5e-5 constant |
| 33 | +- Hardware: 10 nodes (80 GPUs) |
| 34 | + |
| 35 | +| Path | Parallelism | Notes | |
| 36 | +|------|-------------|-------| |
| 37 | +| Megatron-Core | TP=1, PP=1, CP=1 | Sequence packing (32 samples/pack) | |
| 38 | +| Automodel | FSDP | micro_batch_size = 32 | |
| 39 | + |
| 40 | +### Stage 2 |
| 41 | +- Global batch size: 80 videos |
| 42 | +- Learning rate: 5e-5 constant |
| 43 | +- Hardware: 10 nodes (80 GPUs) |
| 44 | + |
| 45 | +| Path | Parallelism | Notes | |
| 46 | +|------|-------------|-------| |
| 47 | +| Megatron-Core | TP=1, PP=1, CP=1 | micro_batch_size = 1 | |
| 48 | +| Automodel | FSDP | micro_batch_size = 1 | |
| 49 | + |
| 50 | + |
| 51 | +## 4. Results |
| 52 | +#### Stage 1 — Loss vs. Steps |
| 53 | +<img src="./medias/training_curves/lm_loss_text2image_3kvids.png" width="700"> |
| 54 | + |
| 55 | +#### Stage 2 — Loss vs. Steps |
| 56 | +<img src="./medias/training_curves/lm_loss_text2video_3kvids.png" width="700"> |
| 57 | +**Note**: Training loss is smoothened with 50 steps averaging. |
| 58 | + |
| 59 | + |
| 60 | +The training curves for both stages have similar value ranges, although they do not match exactly. This is expected due to differences in implementation and training loop setups. |
| 61 | + |
| 62 | +One important caveat: In the current Megatron-Core implementation, the same diffusion time steps are applied to all samples within a pack for each step, rather than different time steps for each sample. As a result, the training loss for Megatron-Core fluctuates more significantly than for AutoModel, especially at the beginning of training. |
0 commit comments