Skip to content

Feature Request: Video Generation Model Support (Wan, DiT architectures) #2796

@sbhavani

Description

@sbhavani

Summary

Request support for video generation models in Megatron Core, specifically Diffusion Transformer (DiT) based architectures like Alibaba's Wan, with optimizations from Transformer Engine including custom FP8 attention kernels.

Motivation

Video generation is a rapidly growing area with models like Wan 2.1, CogVideoX, and OpenSora demonstrating impressive results. However:

  • Incomplete DiT architecture support — Megatron Core has some DiT foundations (conditional embedder gradient sync, bidirectional attention) but lacks key components like adaLN-Zero conditioning
  • Missing video-specific optimizations — Video generation requires handling 3D spatiotemporal tensors with specialized 3D patch embeddings
  • Transformer Engine FP8 untapped — TE has custom FP8 attention kernels (fp8_dpa, fp8_mha) that could significantly accelerate video DiT training but aren't integrated for this use case
  • No reference implementation — While infrastructure exists, there's no end-to-end video generation model example

Current State

Megatron Core has relevant foundations but no video generation support:

  • MIMO framework (megatron/core/models/mimo/) — Experimental multimodal with video understanding (Video-LLaVA), not generation
  • Transformer Engine integrationTEDotProductAttention with FP8 support, but not adapted for DiT architectures
  • Energon dataloader — Video decoding support exists for understanding tasks

Ask

  1. Adaptive LayerNorm (adaLN-Zero) — Implement the core DiT conditioning

  2. 3D patch embedding for video — Extend vision patch embedding to handle temporal dimension

  3. Wan 2.1 reference implementation — End-to-end example with:

    • Text-to-video generation
    • Image-to-video generation
    • Training scripts with distributed parallelism
  4. Transformer Engine FP8 optimization for DiT — Validate and document FP8 training:

    • fp8_dpa for video self-attention

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions