Megatron MoE Model Zoo

Production-ready training recipes for state-of-the-art MoE models — DeepSeek-V3, Qwen3, and Mixtral — built on 🚀 Megatron-Core DEV branch.

✅ Performance-tuned configs for H100, B200, and GB200 clusters
✅ Model-specific best practices for training MoE models ✅ One-command launch with sensible defaults
✅ Dry-run mode to validate arguments before submitting jobs

Best Practices

Ready-to-run scripts with optimized configurations:

Model	Hardware	Scripts
DeepSeek-V3	H100, B200, GB200	`best_practice/DeepSeekV3/`
Qwen3	H100	`best_practice/Qwen3/`
Mixtral	H100	`best_practice/Mixtral/`

See best_practice/ for detailed guides.

Quick Start

Prerequisites

Install yq for YAML processing (one-time setup):

mkdir -p ~/.local/bin && wget -qO ~/.local/bin/yq https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 && chmod +x ~/.local/bin/yq
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

Environment Variables

Variable	Description	Example
`MEGATRON_PATH`	Path to Megatron-LM	`/path/to/Megatron-LM`
`CONTAINER_IMAGE`	Container image path	`/path/to/image.sqsh`
`CLUSTER`	Name of the cluster; used to load cluster-specific settings such as data paths	`EOS`, `CW`
`WANDB_API_KEY`	(Optional) WandB key	From wandb.ai/authorize

Container

Dockerfile: dockers/Dockerfile (also available: B200.Dockerfile, GB200.Dockerfile)

Performance Benchmarking

Supported Models

Mixtral-8x2B, Mixtral-8x7B, Mixtral-8x22B, DeepSeek-V2, DeepSeek-V2-Lite, DeepSeek-V3, Qwen2-57B-A14B, Qwen3-235B-A22B, Qwen3-30B-A3B, Qwen3-Next-80B-A3B

Launch

Basic launch:

MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh

With custom/overwritten parameters:

MODEL=DeepSeek-V3 TP=2 PP=8 EP=64 VPP=1 RUN_TIME=00:60:00 NNODES=64 \
  bash sbatch_benchmarking.sh --recompute-granularity selective --recompute-modules mla_up_proj layernorm

💡 Tip: Dry Run Mode — Preview the generated SLURM script and training command without submitting to the cluster:
DRY_RUN=1 MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh
This is highly recommended before submitting jobs to verify configurations.

Configuration

Runtime configs (runtime_configs/benchmarking/runtime.conf):

Parallelism: TP, PP, EP, CP, VPP, PP_FIRST, PP_LAST, LAYERS_PER_VP
Batch sizes: MBS, GBS
Training: NNODES, RUN_TIME, NUM_LAYERS, SEQ_LEN
MoE: MOE_TOKEN_DISPATCHER, MOE_GROUPED_GEMM

Cluster configs (cluster_configs/benchmarking/template.conf):

Slurm: ACCOUNT, PARTITION, RUN_NAME, CONTAINER_MOUNTS
Paths: OUTPUT_PATH, DATA_PATH, TOKENIZER_MODEL, LOAD_PATH

Job Monitoring

watch -n 1 squeue -u $USER

Checkpoint Conversion

For HF↔MCore conversion, consider MBridge or Megatron-Bridge.

DeepSeek-V3

1. Download and convert to BF16:

git lfs install && git clone https://huggingface.co/deepseek-ai/DeepSeek-V3
python inference/fp8_cast_bf16.py --input-fp8-hf-path /input/fp8/path --output-bf16-hf-path /output/bf16/path

2. Convert to Megatron legacy checkpoint:

MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.sh

3. Convert to distributed checkpoint:

MODEL=DeepSeek-V3 TP=1 PP=4 EP=64 VPP=1 PP_FIRST=16 PP_LAST=13 NNODES=32 LOAD_PATH=/path/to/legacy/ckpt \
  bash ./sbatch_benchmarking.sh --ckpt-convert-save /path/to/dist/ckpt --ckpt-convert-format torch_dist --no-save-optim

Storage: Legacy ~3.4T, Distributed ~1.4T

References

Design Docs - Implementation details for MTP, VPP, EP overlapping, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron MoE Model Zoo

Best Practices

Quick Start

Prerequisites

Environment Variables

Container

Performance Benchmarking

Supported Models

Launch

Configuration

Job Monitoring

Checkpoint Conversion

DeepSeek-V3

References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Megatron MoE Model Zoo

Best Practices

Quick Start

Prerequisites

Environment Variables

Container

Performance Benchmarking

Supported Models

Launch

Configuration

Job Monitoring

Checkpoint Conversion

DeepSeek-V3

References