-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
call for contributiondev branchDev branch related issues and developmentDev branch related issues and developmentmodule: moe
Description
Description
The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.
π This Roadmap is based on the dev branch; please see the details in its README.
Model Support
- β
DeepSeek
- β DeepSeek-V2
- β DeepSeek-V3, including MTP
- β DeepSeek-V3.2 [dev] DeepSeek V3.2 supportΒ #2154
- β
Qwen
- β Qwen2-57B-A14B
- β Qwen3-235B-A22B
- β (πNew!) Qwen3-Next
- β Mixtral
Core MoE Functionality
- β Token dropless MoE - Advanced routing without token dropping
- β Top-K Router with flexible K selection
- β Load balancing losses for expert load balancing optimization
Advanced Parallelism
- β Expert Parallel (EP) with 3D parallelism integration
- β Full parallelism combo: EP + DP + TP + PP + SP support
- β Context Parallel (CP) for long sequence MoE training
- β Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
- β Distributed Optimizer for MoE (ZeRO-1 equivalent)
- β (πNew!) Megatron FSDP/HSDP with full expert parallel support
Optimizations
- β Memory Efficient token permutation
- β Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
- β GroupedGEMM and Gradient Accumulation Fusion
- β DP/PP/TP/EP Communication Overlapping
- β Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
- β cuDNN fused Attention and FlashAttn integration
- β (πNew!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
- β (πNew!) Muon and Layer-wise distributed optimizer
- β (πNew!) Pipeline-aware fine-grained activation offloading [Dev] feat(moe): Fine-grained activation offloadingΒ #1912
- β (πNew!) Production-ready cudaGraph support for MoE
Precision Support
- β GroupedGEMM including FP8/MXFP8 support
- β FP8 weights with BF16 optimizer states
- β FP8 training full support
Optimized Expert Parallel Communication Support
- β DeepEP support for H100 and B200
- β (πNew!) HybridEP for GB200
Developer Experience
- β MoE Model Zoo with pre-training best practices
- β MCore2HF Converter for ecosystem compatibility in megatron-bridge
- β Distributed Checkpointing Support
- β Runtime Upcycling Support for efficient model scaling
- β Layer-wise logging for detailed monitoring
Next Release Roadmap (MCore v0.17)
Performance & Kernel Optimizations
- Split-K Indexer Kernels - Avoid materializing [seqlen_q, seqlen_k] tensor with split-K kernels [WIP Feat] Split-K Indexer KernelsΒ #2869 (draft)
- Absorbed MLA - MLA computation optimization for DSA Add absorbed-mla & fused dsaΒ #3044
- A2A Overlap Refinement - Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 [Dev](moe):Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1Β #2730
- HybridEP preprocess optimization
Long Context & Context Parallel
- Hybrid CP Part 2 - Enhanced hybrid data x context parallelism [Dev] feat: Dynamic CP (part 2)Β #2000
- THD Format E2E Support - End-to-end THD format support Add E2E support for THD formatΒ #2924 (draft)
Model & Architecture
- Manifold Hyper Connection (mHC) - New architecture feature for improved loss convergence [dev] feat(mHC): Add basic pytorch implementation of manifold hyper connection(mHC).Β #2943 (draft)
- GDN THD Support - Packed sequence support for gated delta net [dev] feat(moe): Support packed sequence for gated delta net (GDN)Β #2644
- GDN Refinement - Refine gated delta net implementation [dev] perf(moe): Refine gated delta net implementationΒ #3040
Advanced Functionality
- Router replay support for RL training (in progress) feat: add routing replay for McoreΒ #2693
- Megatron FSDP performance optimization for MoE training
CUDA Graph Enhancements
- MoE ECHO - Elastic Cloning for sync-free, full CUDA-graph dropless MoE training MoE ECHO: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic CloningΒ #2368 (draft)
- Paged Stashing - Dynamic tensor support for dropless MoE with CUDA graph Paged StashingΒ #2690 (draft)
- CUDA Graph + Offloading - Support CUDA Graph capture with offloading modules Support CUDA Graph capture offloading modulesΒ #2437 (draft)
- Optimizer CUDA Graph - Enable CUDA graph for ADAM optimizer Enable Optimizer CUDA graph for ADAM optimizerΒ #2931 (draft)
Ongoing Long-term Features
- E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
- Sync-Free and Full-Iter cudaGraph MoE Training
- MoE ECHO for dropless MoE load balancing MoE ECHO: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic CloningΒ #2368
- Paged Stashing for dynamic tensor support Paged StashingΒ #2690
- CPU Overhead Optimizations for Blackwell Performance
- MLA Optimizations
- Absorbed MLA Add absorbed-mla & fused dsaΒ #3044
- MLA CP 2.0 - MLA CP Enhancement for Longer Sequence Training
- THD and Long Context
- THD Format E2E Support - End-to-end THD format support Add E2E support for THD formatΒ #2924
- Dynamic Context Parallel for Imbalanced Long-Sequence Training
- Megatron FSDP Performance Optimization for MoE Training
- Kernel fusions and optimizations for MoE models from TE MoE training optimizationΒ TransformerEngine#2438
- New Architecture Support
- Manifold Hyper Connection (mHC) [dev] feat(mHC): Add basic pytorch implementation of manifold hyper connection(mHC).Β #2943
v0.16 Update Highlights
Performance & Memory
- π Fused Linear and Cross Entropy - Fuse lm_head and CE to avoid materializing intermediate logits, reducing memory [Dev] Feature: linear cross entropy fusionΒ #2256
- π Optimizer State Offloading - Offload optimizer states and master weights to CPU for significant GPU memory savings [Dev] [Reapply] Optimizer State and Master Weight OffloadingΒ #2987
- π MTP Standalone Stages - Support placing MTP layers into standalone pipeline stages for better VPP balance [Dev] feat(moe): Support placing MTP layers into standalone stagesΒ #1916
- π DeepSeek V3.2 Support - Performance optimizations for DeepSeek V3.2 [dev] DeepSeek V3.2 supportΒ #2154
- DeepSeek V3 Pre-training Performance Guide (~960 TFLOPS/GPU on 256 GB200s) [Dev] A Guide to Reproduce DeepSeek-V3 Pre-training Performance on GB200Β #1996
CUDA Graph
- π cuda_graph_scope Refactoring [Dev] feat(MoE): Refactor cuda_graph_scope - part2Β #2353 [Dev] TE cudagraph recomputeΒ #2694
- π Partial CUDA Graph for EP Overlap - Release CPU pressure within selected scope for EP A2A overlap [Dev](Reapply) Partial CUDA Graph support for EP OverlapΒ #2810
- TE cudagraph input memory optimization - Reuse static input memory buffer among microbatches [Dev] Optimize TE cudagraph input memoryΒ #2391
- CudaGraph compatibility with FP8 params (tensorwise & blockwise) [DEV] Make CUDA graph compatible with FP8 params (tensorwise & blockwise).Β #2087
- NVFP4 MOE CUDA Graph support with 128 zero padding [DEV][NVFP4][MOE] 128 Zero Padding for Grouped Quantization kernels and Cuda Graph Support Β #2654
Model & Parallelism
- π Qwen3-Next Enhancements - QK layernorm weight decay support and Gated Delta Net CP for long context [dev] feat(moe): Support apply wd to qk layernorm for Qwen3-NextΒ #2825 [Dev] Feat(moe): Gated delta net context parallel (CP)Β #2614
- π Hybrid Data x Context Parallelism - New parallelism strategy combining DP and CP [Dev] Hybrid Data x Context Parallelism FeatureΒ #2054
- π Router Replay - Deterministic routing mechanism for debugging and RL training feat: add routing replay for McoreΒ #2693
- π Fake Distributed Process Group - Skip all distributed comm ops with
--fake-process-groupfor profiling [DEV] Add support of fake distributed process groupΒ #2254 - Remove padding token in MoE routing loss - Improve aux loss correctness and efficiency [Dev] Remove calculation of padding token in moe routing lossΒ #2121
- Context parallel support for eager attention implementation [Community][Dev] feat(moe): Adding context parallel support to eager attention implementationΒ #1859
- Packed sequence support in MTP module [Dev] Support packed seq in MTPΒ #2043
Fine-grained Activation Offloading Enhancement
- π OOP Refactoring - Object-oriented redesign for fine-grained activation offloading [Dev]feat(moe): code refactor for fine grained activation offloadingΒ #2905
- Fix accuracy mismatch when offloading and recomputing same module [Dev] fix(offloading): Accuracy mismatch when offloading and recomputing same moduleΒ #2123
- Bug fix for fine-grained activation offloading in evaluate() [Dev] [fix] Bug fix for fine-grained activation offloading in evaluate()Β #3041
Megatron-FSDP
- π FP8 Params Support - MXFP8/Blockwise FP8 params for Megatron-FSDP [dev] Reapply fsdp mxfp8Β #2828
- π HSDP Support - Hybrid Sharded Data Parallel with EP submesh registration Fix HSDP Registering Device MeshΒ #2388
- Megatron-FSDP user guide documentation [Dev] docs(megatron-fsdp): add Megatron-FSDP user guideΒ #2397
Communication
- π Hybrid-EP Upgrade - Latest Hybrid-EP with kernel optimizations for EP64 and NVL8+IB [Dev] Use the latest Hybrid-EPΒ #2424
- HybridEP memory overhead reduction for 1F1B A2A overlap [Dev] fix(moe): Support HybridEP and reduce memory overhead for 1F1B A2A overlapΒ #2201
Optimizer
- π LayerWise DistOpt - LayerWiseDistributedOptimizer with torch_dist checkpoint format and Muon support [Dev] Support LayerWiseDistributedOptimizer with torch_dist checkpoint formatΒ #1928 [DEV] Update emerging optimizersΒ #2261
Critical Bug Fixes
- Megatron-FSDP hang fix - Resolve hang caused by non-deterministic reduce-scatter [Dev] fix(megatron-fsdp): Resolve hang caused by non-deterministic reduce-scatterΒ #2252
- EP Overlap correctness - Fix missing final layernorm in EP overlap [Dev] Fix ep overlap missing final layernormΒ #2691
- Hybrid-EP hotfix - Fix bug of hybrid-ep backend in flex-dispatcher [DEV] [HOT FIX] Fix bug of hybrid-ep backend in flex-dispatcherΒ #2287
- CUDA RNG Tracker - Fix RNG tracker to use expert-parallel-rng correctly [Dev] Fix CUDA RNG TrackerΒ #2640
Call for Community Contributions
- Model implementations - Additional MoE model variants
- Performance testing - Performance tests across different platforms and workloads
- Documentation and tutorials - Best practices and optimization guides
- Bug fixes
This roadmap reflects the collective efforts of NVIDIA and our collaborators
Credits: MCore MoE Team and @sbhavani
Labels: roadmap, moe, call-for-contribution
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
call for contributiondev branchDev branch related issues and developmentDev branch related issues and developmentmodule: moe