[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap

## Description

The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

**🎉 This Roadmap is based on the [dev branch](https://github.com/NVIDIA/Megatron-LM/tree/dev); please see the details in its README.**

---

### Model Support

* ✅ **DeepSeek**  
  * ✅ DeepSeek-V2  
  * ✅ DeepSeek-V3, including MTP  
  * ✅ DeepSeek-V3.2 #2154 
* ✅ **Qwen**  
  * ✅ Qwen2-57B-A14B  
  * ✅ Qwen3-235B-A22B  
  * ✅ **(🚀New\!) Qwen3-Next** 
* ✅ **Mixtral**  

### Core MoE Functionality

* ✅ **Token dropless MoE** \- Advanced routing without token dropping  
* ✅ **Top-K Router** with flexible K selection  
* ✅ **Load balancing losses** for expert load balancing optimization

### Advanced Parallelism

* ✅ **Expert Parallel (EP)** with 3D parallelism integration  
* ✅ **Full parallelism combo**: EP \+ DP \+ TP \+ PP \+ SP support  
* ✅ **Context Parallel (CP)** for long sequence MoE training  
* ✅ **Parallel Folding** Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training  
* ✅ **Distributed Optimizer for MoE** (ZeRO-1 equivalent)  
* ✅ **(🚀New\!) Megatron FSDP**/HSDP with full **expert parallel support**

### Optimizations

* ✅ **Memory Efficient token permutation**  
* ✅ **Fine-grained Recomputations** (mla, moe, mlp, moe\_act, norm)  
* ✅ **GroupedGEMM** and Gradient Accumulation Fusion  
* ✅ **DP/PP/TP/EP Communication Overlapping**  
* ✅ **Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc**  
* ✅ **cuDNN fused Attention** and FlashAttn integration  
* ✅  (**🚀New\!) 1F1B EP A2A Overlap** \- Hiding Expert Parallel Communication with 1F1B Pipeline Schedule  
* ✅ **(🚀New\!) Muon and Layer-wise distributed optimizer** 
* ✅ **(🚀New\!) Pipeline-aware fine-grained activation offloading** #1912 
* ✅ **(🚀New\!) Production-ready cudaGraph support for MoE**

### Precision Support

* ✅ **GroupedGEMM** including FP8/MXFP8 support  
* ✅ **FP8 weights with BF16 optimizer states**  
* ✅ **FP8 training** full support

### Optimized Expert Parallel Communication Support

* ✅ **DeepEP support for H100 and B200**  
* ✅ **(🚀New\!) HybridEP for GB200** 

### Developer Experience

* ✅ [**MoE Model Zoo**](https://github.com/yanring/Megatron-MoE-ModelZoo) with pre-training best practices 
* ✅ **MCore2HF Converter** for ecosystem compatibility in megatron-bridge  
* ✅ **Distributed Checkpointing Support**  
* ✅ **Runtime Upcycling Support** for efficient model scaling  
* ✅ **Layer-wise logging** for detailed monitoring
---
## Next Release Roadmap (MCore v0.17)

### Performance & Kernel Optimizations

- [ ] **Split-K Indexer Kernels** - Avoid materializing [seqlen_q, seqlen_k] tensor with split-K kernels #2869 *(draft)*
- [ ] **Absorbed MLA** - MLA computation optimization for DSA #3044
- [ ] **A2A Overlap Refinement** - Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 #2730
- [ ] HybridEP preprocess optimization

### Long Context & Context Parallel

- [ ] **Hybrid CP Part 2** - Enhanced hybrid data x context parallelism #2000
- [ ] **THD Format E2E Support** - End-to-end THD format support #2924 *(draft)*

### Model & Architecture

- [ ] **Manifold Hyper Connection (mHC)** - New architecture feature for improved loss convergence #2943 *(draft)*
- [ ] **GDN THD Support** - Packed sequence support for gated delta net #2644
- [ ] **GDN Refinement** - Refine gated delta net implementation #3040

### Advanced Functionality

- [x] Router replay support for RL training (in progress) #2693
- [ ] Megatron FSDP performance optimization for MoE training

CUDA Graph Enhancements

- [ ] **MoE ECHO** - Elastic Cloning for sync-free, full CUDA-graph dropless MoE training #2368 *(draft)*
- [ ] **Paged Stashing** - Dynamic tensor support for dropless MoE with CUDA graph #2690 *(draft)*
- [ ] **CUDA Graph + Offloading** - Support CUDA Graph capture with offloading modules #2437 *(draft)*
- [ ] **Optimizer CUDA Graph** - Enable CUDA graph for ADAM optimizer #2931 *(draft)*

## Ongoing Long-term Features

* **E2E Performance optimization** for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
* **Sync-Free and Full-Iter cudaGraph MoE Training**
  * **MoE ECHO** for dropless MoE load balancing #2368
  * **Paged Stashing** for dynamic tensor support #2690
* **CPU Overhead Optimizations** for Blackwell Performance
* **MLA Optimizations**
  * **Absorbed MLA** #3044
  * **MLA CP 2.0** - MLA CP Enhancement for Longer Sequence Training
* THD and Long Context
	* **THD Format E2E Support** - End-to-end THD format support #2924
	* **Dynamic Context Parallel** for Imbalanced Long-Sequence Training
	  * #2054
	  * #2000
* **Megatron FSDP Performance Optimization for MoE Training**
* **Kernel fusions and optimizations for MoE models from TE** https://github.com/NVIDIA/TransformerEngine/issues/2438
* **New Architecture Support**
  * **Manifold Hyper Connection (mHC)** #2943
---

## **v0.16 Update Highlights**

### Performance & Memory

- [x] **🚀 Fused Linear and Cross Entropy** - Fuse lm_head and CE to avoid materializing intermediate logits, reducing memory #2256
- [x] **🚀 Optimizer State Offloading** - Offload optimizer states and master weights to CPU for significant GPU memory savings #2987
- [x] **🚀 MTP Standalone Stages** - Support placing MTP layers into standalone pipeline stages for better VPP balance #1916
- [x] **🚀 DeepSeek V3.2 Support** - Performance optimizations for DeepSeek V3.2 #2154
- [x] DeepSeek V3 Pre-training Performance Guide (~960 TFLOPS/GPU on 256 GB200s) #1996

### CUDA Graph

- [x] **🚀 cuda_graph_scope Refactoring** #2353 #2694
- [x] **🚀 Partial CUDA Graph for EP Overlap** - Release CPU pressure within selected scope for EP A2A overlap #2810
- [x] TE cudagraph input memory optimization - Reuse static input memory buffer among microbatches #2391
- [x] CudaGraph compatibility with FP8 params (tensorwise & blockwise) #2087
- [x] NVFP4 MOE CUDA Graph support with 128 zero padding #2654

### Model & Parallelism

- [x] **🚀 Qwen3-Next Enhancements** - QK layernorm weight decay support and Gated Delta Net CP for long context #2825 #2614
- [x] **🚀 Hybrid Data x Context Parallelism** - New parallelism strategy combining DP and CP #2054
- [x] **🚀 Router Replay** - Deterministic routing mechanism for debugging and RL training #2693
- [x] **🚀 Fake Distributed Process Group** - Skip all distributed comm ops with `--fake-process-group` for profiling #2254
- [x] Remove padding token in MoE routing loss - Improve aux loss correctness and efficiency #2121
- [x] Context parallel support for eager attention implementation #1859
- [x] Packed sequence support in MTP module #2043

### Fine-grained Activation Offloading Enhancement

- [x] **🚀 OOP Refactoring** - Object-oriented redesign for fine-grained activation offloading #2905
- [x] Fix accuracy mismatch when offloading and recomputing same module #2123
- [x] Bug fix for fine-grained activation offloading in evaluate() #3041

### Megatron-FSDP

- [x] **🚀 FP8 Params Support** - MXFP8/Blockwise FP8 params for Megatron-FSDP #2828
- [x] **🚀 HSDP Support** - Hybrid Sharded Data Parallel with EP submesh registration #2388
- [x] Megatron-FSDP user guide documentation #2397

### Communication

- [x] **🚀 Hybrid-EP Upgrade** - Latest Hybrid-EP with kernel optimizations for EP64 and NVL8+IB #2424
- [x] HybridEP memory overhead reduction for 1F1B A2A overlap #2201

### Optimizer

- [x] **🚀 LayerWise DistOpt** - LayerWiseDistributedOptimizer with torch_dist checkpoint format and Muon support #1928 #2261

### Critical Bug Fixes

- [x] **Megatron-FSDP hang fix** - Resolve hang caused by non-deterministic reduce-scatter #2252
- [x] **EP Overlap correctness** - Fix missing final layernorm in EP overlap #2691
- [x] **Hybrid-EP hotfix** - Fix bug of hybrid-ep backend in flex-dispatcher #2287
- [x] **CUDA RNG Tracker** - Fix RNG tracker to use expert-parallel-rng correctly #2640
---

## Call for Community Contributions

* **Model implementations** \- Additional MoE model variants  
* **Performance testing** \- Performance tests across different platforms and workloads  
* **Documentation and tutorials** \- Best practices and optimization guides
* Bug fixes

---
This roadmap reflects the collective efforts of NVIDIA and our collaborators

Credits: MCore MoE Team and @sbhavani

**Labels:** `roadmap`, `moe`, `call-for-contribution`  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap #1729

Description

Model Support

Core MoE Functionality

Advanced Parallelism

Optimizations

Precision Support

Optimized Expert Parallel Communication Support

Developer Experience

Next Release Roadmap (MCore v0.17)

Performance & Kernel Optimizations

Long Context & Context Parallel

Model & Architecture

Advanced Functionality

Ongoing Long-term Features

v0.16 Update Highlights

Performance & Memory

CUDA Graph

Model & Parallelism

Fine-grained Activation Offloading Enhancement

Megatron-FSDP

Communication

Optimizer

Critical Bug Fixes

Call for Community Contributions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap #1729

Description

Description

Model Support

Core MoE Functionality

Advanced Parallelism

Optimizations

Precision Support

Optimized Expert Parallel Communication Support

Developer Experience

Next Release Roadmap (MCore v0.17)

Performance & Kernel Optimizations

Long Context & Context Parallel

Model & Architecture

Advanced Functionality

Ongoing Long-term Features

v0.16 Update Highlights

Performance & Memory

CUDA Graph

Model & Parallelism

Fine-grained Activation Offloading Enhancement

Megatron-FSDP

Communication

Optimizer

Critical Bug Fixes

Call for Community Contributions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions