Skip to content

MooreThreads/MT-MegatronLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

MT-Megatron is a python patch of Megatron-LM. Key features include:

  • Simple Cuda Compatibility: Use torch_musa to replace PyTorch CUDA functions with identical APIs.

  • Ready-to-Use Training Scripts: Provide launch traning scripts for large-scale models, such as DeepSeek, Llama, etc

  • Proven Scalability & Stability: Extensively tested on Moore Threads's large-scale GPU clusters (thousands of GPUs), ensuring high MFU and long-term reliability.

  • Optimized Performance: Enhanced with MT-TransformerEngine for additional acceleration, such as per block fp8, moe recompute, zero bubble, etc.

  • Portable Cross-Platform Compatibility: Requires only minor adaptations to run on other GPU backends.

Getting started

1. Prepare the code

You can create a directory named train_dev, and use the command below to clone the MT-Megatron-LM, MT-TransformerEngine and Megatron-LM to the train_dev.

Note:

  1. In this repository, we provide an official Megatron-LM commit ID as a stable version. Using this version ensures stability with the example models.

  2. Since the official Megatron-LM evolves rapidly, we cannot maintain full development and adaptation support for every version, including the latest. Therefore, we encourage external developers to experiment with Megatron-LM’s daily main branch or newer releases for further customization. Note that MT-Megatron-LM is not limited to Moore Threads' GPUs, it also supports other GPU backends.

# clone MT-Megatron-LM
git clone https://github.com/MooreThreads/MT-MegatronLM/tree/main

# clone MT-TransformerEngine
git clone https://github.com/MooreThreads/MT-TransformerEngine
pushd MT-TransformerEngine
bash install.sh
popd

# clone Megatron-LM
git clone https://github.com/NVIDIA/Megatron-LM
git checkout -b dev/musa fdfcef87

2. Edit the hostfile

In the directory of the model you want to launch, e.g., examples/deepseek-v3, create a hostfile containing the IP addresses of all GPU node participating in distributed training. The launch script will read the IPs from hostfile, establish SSH connections to each node, and finally initiate training using torchrun.

node1-ip
node2-ip
...

3. Launch Multi-Node Training

DeepSeekV2

cd examples/deepseek-v2
bash run_deepseekv2.sh

Llama3

cd examples/llama3
bash dist_run_pretrain_megatron_llama3_musa.sh

Surpported Model

Model List Availability
Llama3
DeepSeek-V3
DeepSeek-V2
Mixtral

Future Plan

We will share our training experience on clusters with thousands of GPUs in this repo.

Community

Issue Reporting

If you find any problems for large model training using MT-Megatron, please open an issue.

Contributions

Welcome any form of contribution of code, model implementation and document!

Acknowledgements

Initial development leveraged code from the FlagScale, acknowledgments to their team.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 12