-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Open
Open
Copy link
Labels
Description
Summary
Request support for video generation models in Megatron Core, specifically Diffusion Transformer (DiT) based architectures like Alibaba's Wan, with optimizations from Transformer Engine including custom FP8 attention kernels.
Motivation
Video generation is a rapidly growing area with models like Wan 2.1, CogVideoX, and OpenSora demonstrating impressive results. However:
- Incomplete DiT architecture support — Megatron Core has some DiT foundations (conditional embedder gradient sync, bidirectional attention) but lacks key components like adaLN-Zero conditioning
- Missing video-specific optimizations — Video generation requires handling 3D spatiotemporal tensors with specialized 3D patch embeddings
- Transformer Engine FP8 untapped — TE has custom FP8 attention kernels (
fp8_dpa,fp8_mha) that could significantly accelerate video DiT training but aren't integrated for this use case - No reference implementation — While infrastructure exists, there's no end-to-end video generation model example
Current State
Megatron Core has relevant foundations but no video generation support:
- MIMO framework (
megatron/core/models/mimo/) — Experimental multimodal with video understanding (Video-LLaVA), not generation - Transformer Engine integration —
TEDotProductAttentionwith FP8 support, but not adapted for DiT architectures - Energon dataloader — Video decoding support exists for understanding tasks
Ask
-
Adaptive LayerNorm (adaLN-Zero) — Implement the core DiT conditioning
-
3D patch embedding for video — Extend vision patch embedding to handle temporal dimension
-
Wan 2.1 reference implementation — End-to-end example with:
- Text-to-video generation
- Image-to-video generation
- Training scripts with distributed parallelism
-
Transformer Engine FP8 optimization for DiT — Validate and document FP8 training:
fp8_dpafor video self-attention
References
- Wan 2.1 — Alibaba's open video generation model
- DiT Paper — Scalable Diffusion Models with Transformers
- Will support diffusion models ? #1592