MT-Megatron is a python patch of Megatron-LM. Key features include:
-
Simple Cuda Compatibility: Use torch_musa to replace PyTorch CUDA functions with identical APIs.
-
Ready-to-Use Training Scripts: Provide launch traning scripts for large-scale models, such as DeepSeek, Llama, etc
-
Proven Scalability & Stability: Extensively tested on Moore Threads's large-scale GPU clusters (thousands of GPUs), ensuring high MFU and long-term reliability.
-
Optimized Performance: Enhanced with MT-TransformerEngine for additional acceleration, such as per block fp8, moe recompute, zero bubble, etc.
-
Portable Cross-Platform Compatibility: Requires only minor adaptations to run on other GPU backends.
You can create a directory named train_dev
, and use the command below to clone the MT-Megatron-LM
, MT-TransformerEngine
and Megatron-LM
to the train_dev
.
Note:
-
In this repository, we provide an official Megatron-LM commit ID as a stable version. Using this version ensures stability with the example models.
-
Since the official Megatron-LM evolves rapidly, we cannot maintain full development and adaptation support for every version, including the latest. Therefore, we encourage external developers to experiment with Megatron-LM’s daily main branch or newer releases for further customization. Note that MT-Megatron-LM is not limited to Moore Threads' GPUs, it also supports other GPU backends.
# clone MT-Megatron-LM
git clone https://github.com/MooreThreads/MT-MegatronLM/tree/main
# clone MT-TransformerEngine
git clone https://github.com/MooreThreads/MT-TransformerEngine
pushd MT-TransformerEngine
bash install.sh
popd
# clone Megatron-LM
git clone https://github.com/NVIDIA/Megatron-LM
git checkout -b dev/musa fdfcef87
In the directory of the model you want to launch, e.g., examples/deepseek-v3, create a hostfile containing the IP addresses of all GPU node participating in distributed training. The launch script will read the IPs from hostfile, establish SSH connections to each node, and finally initiate training using torchrun.
node1-ip
node2-ip
...
cd examples/deepseek-v2
bash run_deepseekv2.sh
cd examples/llama3
bash dist_run_pretrain_megatron_llama3_musa.sh
Model List | Availability |
---|---|
Llama3 | ✔ |
DeepSeek-V3 | ✔ |
DeepSeek-V2 | ✔ |
Mixtral | ✔ |
We will share our training experience on clusters with thousands of GPUs in this repo.
If you find any problems for large model training using MT-Megatron, please open an issue.
Welcome any form of contribution of code, model implementation and document!
Initial development leveraged code from the FlagScale, acknowledgments to their team.