Introduction

MT-Megatron is a python patch of Megatron-LM. Key features include:

Simple Cuda Compatibility: Use torch_musa to replace PyTorch CUDA functions with identical APIs.
Ready-to-Use Training Scripts: Provide launch traning scripts for large-scale models, such as DeepSeek, Llama, etc
Proven Scalability & Stability: Extensively tested on Moore Threads's large-scale GPU clusters (thousands of GPUs), ensuring high MFU and long-term reliability.
Optimized Performance: Enhanced with MT-TransformerEngine for additional acceleration, such as per block fp8, moe recompute, zero bubble, etc.
Portable Cross-Platform Compatibility: Requires only minor adaptations to run on other GPU backends.

Getting started

1. Prepare the code

You can create a directory named train_dev, and use the command below to clone the MT-Megatron-LM, MT-TransformerEngine and Megatron-LM to the train_dev.

Note:

In this repository, we provide an official Megatron-LM commit ID as a stable version. Using this version ensures stability with the example models.
Since the official Megatron-LM evolves rapidly, we cannot maintain full development and adaptation support for every version, including the latest. Therefore, we encourage external developers to experiment with Megatron-LM’s daily main branch or newer releases for further customization. Note that MT-Megatron-LM is not limited to Moore Threads' GPUs, it also supports other GPU backends.

# clone MT-Megatron-LM
git clone https://github.com/MooreThreads/MT-MegatronLM/tree/main

# clone MT-TransformerEngine
git clone https://github.com/MooreThreads/MT-TransformerEngine
pushd MT-TransformerEngine
bash install.sh
popd

# clone Megatron-LM
git clone https://github.com/NVIDIA/Megatron-LM
git checkout -b dev/musa fdfcef87

2. Edit the hostfile

In the directory of the model you want to launch, e.g., examples/deepseek-v3, create a hostfile containing the IP addresses of all GPU node participating in distributed training. The launch script will read the IPs from hostfile, establish SSH connections to each node, and finally initiate training using torchrun.

node1-ip
node2-ip
...

3. Launch Multi-Node Training

DeepSeekV2

cd examples/deepseek-v2
bash run_deepseekv2.sh

Llama3

cd examples/llama3
bash dist_run_pretrain_megatron_llama3_musa.sh

Surpported Model

Model List	Availability
Llama3	✔
DeepSeek-V3	✔
DeepSeek-V2	✔
Mixtral	✔

Future Plan

We will share our training experience on clusters with thousands of GPUs in this repo.

Community

Issue Reporting

If you find any problems for large model training using MT-Megatron, please open an issue.

Contributions

Welcome any form of contribution of code, model implementation and document!

Acknowledgements

Initial development leveraged code from the FlagScale, acknowledgments to their team.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
cuda_patch		cuda_patch
examples		examples
musa_patch		musa_patch
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Getting started

1. Prepare the code

2. Edit the hostfile

3. Launch Multi-Node Training

DeepSeekV2

Llama3

Surpported Model

Future Plan

Community

Issue Reporting

Contributions

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 12

Uh oh!

Languages

MooreThreads/MT-MegatronLM

Folders and files

Latest commit

History

Repository files navigation

Introduction

Getting started

1. Prepare the code

2. Edit the hostfile

3. Launch Multi-Node Training

DeepSeekV2

Llama3

Surpported Model

Future Plan

Community

Issue Reporting

Contributions

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Uh oh!

Languages

Packages