Skip to content

Commit 1d8891e

Browse files
huvunvidiaHuy Vu2abhinavg4
authored
Report Mcore vs AutoModel (#72)
* report for public version * fix image size * Update report.md for Wan 2.1 convergence comparison, correcting formatting and ensuring clarity in experiment overview and caveats regarding training loss fluctuations between Diffusers and Megatron-Core implementations. --------- Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: Abhinav Garg <[email protected]>
1 parent b867706 commit 1d8891e

File tree

3 files changed

+62
-0
lines changed

3 files changed

+62
-0
lines changed
193 KB
Loading
223 KB
Loading
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Wan 2.1 – Partial Convergence Comparison
2+
### Diffusers (Automodel path) vs. Megatron-Core (Megatron-Bridge path)
3+
4+
---
5+
6+
## 1. Experiment Overview
7+
- Goal: Compare two training paths for Wan 2.1:
8+
**(1) [Diffusers](https://huggingface.co/docs/diffusers/en/index) implementaion + [Automodel](https://github.com/NVIDIA-NeMo/Automodel/tree/diffusion) training path** vs. **(2) [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) implementaion + [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) training path**
9+
- Two-stage training:
10+
- **Stage 1:** Text → Image - Learn to connect textual embeddings with visual concepts.
11+
- **Stage 2:** Text → Video - Learn visual movements aligning with prompts.
12+
- Dataset: 3,000 videos; frames extracted from videos are used for text-to-image training stage.
13+
14+
15+
## 2. Dataset
16+
17+
### Stage 1 (Text-to-Image)
18+
- Extract 40 frames per video → **120k images**
19+
- Resolution: **240 × 416**
20+
- Each frame uses same caption as parent video.
21+
22+
### Stage 2 (Text-to-Video)
23+
- Full videos → **3,000 videos**
24+
- Resolution: **240 × 416**, duration 4–8 seconds.
25+
26+
**Note**: This experiment is a partial convergence test and only demonstrates the model's ability to reconstruct images and videos from input prompts. With only 3,000 videos, the model cannot generalize to generate novel content. Such generalization can be achieved with larger training dataset and increased training resources.
27+
28+
## 3. Training Setup
29+
30+
### Stage 1
31+
- Global batch size: 2560 images
32+
- Learning rate: warmup 10k → 5e-5 constant
33+
- Hardware: 10 nodes (80 GPUs)
34+
35+
| Path | Parallelism | Notes |
36+
|------|-------------|-------|
37+
| Megatron-Core | TP=1, PP=1, CP=1 | Sequence packing (32 samples/pack) |
38+
| Automodel | FSDP | micro_batch_size = 32 |
39+
40+
### Stage 2
41+
- Global batch size: 80 videos
42+
- Learning rate: 5e-5 constant
43+
- Hardware: 10 nodes (80 GPUs)
44+
45+
| Path | Parallelism | Notes |
46+
|------|-------------|-------|
47+
| Megatron-Core | TP=1, PP=1, CP=1 | micro_batch_size = 1 |
48+
| Automodel | FSDP | micro_batch_size = 1 |
49+
50+
51+
## 4. Results
52+
#### Stage 1 — Loss vs. Steps
53+
<img src="./medias/training_curves/lm_loss_text2image_3kvids.png" width="700">
54+
55+
#### Stage 2 — Loss vs. Steps
56+
<img src="./medias/training_curves/lm_loss_text2video_3kvids.png" width="700">
57+
**Note**: Training loss is smoothened with 50 steps averaging.
58+
59+
60+
The training curves for both stages have similar value ranges, although they do not match exactly. This is expected due to differences in implementation and training loop setups.
61+
62+
One important caveat: In the current Megatron-Core implementation, the same diffusion time steps are applied to all samples within a pack for each step, rather than different time steps for each sample. As a result, the training loss for Megatron-Core fluctuates more significantly than for AutoModel, especially at the beginning of training.

0 commit comments

Comments
 (0)