Skip to content

[ROADMAP][Updated on Jan 26] Megatron Core MoE RoadmapΒ #1729

@yanring

Description

@yanring

Description

The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

πŸŽ‰ This Roadmap is based on the dev branch; please see the details in its README.


Model Support

  • βœ… DeepSeek
  • βœ… Qwen
    • βœ… Qwen2-57B-A14B
    • βœ… Qwen3-235B-A22B
    • βœ… (πŸš€New!) Qwen3-Next
  • βœ… Mixtral

Core MoE Functionality

  • βœ… Token dropless MoE - Advanced routing without token dropping
  • βœ… Top-K Router with flexible K selection
  • βœ… Load balancing losses for expert load balancing optimization

Advanced Parallelism

  • βœ… Expert Parallel (EP) with 3D parallelism integration
  • βœ… Full parallelism combo: EP + DP + TP + PP + SP support
  • βœ… Context Parallel (CP) for long sequence MoE training
  • βœ… Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
  • βœ… Distributed Optimizer for MoE (ZeRO-1 equivalent)
  • βœ… (πŸš€New!) Megatron FSDP/HSDP with full expert parallel support

Optimizations

  • βœ… Memory Efficient token permutation
  • βœ… Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
  • βœ… GroupedGEMM and Gradient Accumulation Fusion
  • βœ… DP/PP/TP/EP Communication Overlapping
  • βœ… Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
  • βœ… cuDNN fused Attention and FlashAttn integration
  • βœ… (πŸš€New!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
  • βœ… (πŸš€New!) Muon and Layer-wise distributed optimizer
  • βœ… (πŸš€New!) Pipeline-aware fine-grained activation offloading [Dev] feat(moe): Fine-grained activation offloadingΒ #1912
  • βœ… (πŸš€New!) Production-ready cudaGraph support for MoE

Precision Support

  • βœ… GroupedGEMM including FP8/MXFP8 support
  • βœ… FP8 weights with BF16 optimizer states
  • βœ… FP8 training full support

Optimized Expert Parallel Communication Support

  • βœ… DeepEP support for H100 and B200
  • βœ… (πŸš€New!) HybridEP for GB200

Developer Experience

  • βœ… MoE Model Zoo with pre-training best practices
  • βœ… MCore2HF Converter for ecosystem compatibility in megatron-bridge
  • βœ… Distributed Checkpointing Support
  • βœ… Runtime Upcycling Support for efficient model scaling
  • βœ… Layer-wise logging for detailed monitoring

Next Release Roadmap (MCore v0.17)

Performance & Kernel Optimizations

Long Context & Context Parallel

Model & Architecture

Advanced Functionality

CUDA Graph Enhancements

Ongoing Long-term Features


v0.16 Update Highlights

Performance & Memory

CUDA Graph

Model & Parallelism

Fine-grained Activation Offloading Enhancement

Megatron-FSDP

Communication

Optimizer

Critical Bug Fixes


Call for Community Contributions

  • Model implementations - Additional MoE model variants
  • Performance testing - Performance tests across different platforms and workloads
  • Documentation and tutorials - Best practices and optimization guides
  • Bug fixes

This roadmap reflects the collective efforts of NVIDIA and our collaborators

Credits: MCore MoE Team and @sbhavani

Labels: roadmap, moe, call-for-contribution

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions