Skip to content

Commit 2c3e3c9

Browse files
authored
add Qwen-3 (#578)
1 parent 164f377 commit 2c3e3c9

File tree

17 files changed

+2727
-75
lines changed

17 files changed

+2727
-75
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
| | Megatron-LM-Dense | Megatron-Core-Dense | Megatron-Core-MoE | MegaBlocks-MoE |
55
|:------------|:--------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------:|
6+
| Qwen3 | N/A | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen3/README.md#Megatron-Core模型训练流程) | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen3/README.md#Megatron-Core模型训练流程) | N/A |
67
| QwQ | N/A | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwq/README.md#Megatron-Core模型训练流程) | N/A | N/A |
78
| Qwen2.5-VL | N/A | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen2_5_vl/README.md#Megatron-Core模型训练流程) | N/A | N/A |
89
| Moonlight | N/A | N/A | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/moonlight/README.md#Megatron-Core-MoE模型训练流程) | N/A |
@@ -25,6 +26,7 @@ English | [简体中文](./README_zh-CN.md)
2526
Pai-Megatron-Patch (https://github.com/alibaba/Pai-Megatron-Patch) is a deep learning training toolkit built for developers to train and predict LLMs & VLMs by using Megatron framework easily. With the continuous development of LLMs, the model structure and scale are rapidly evolving. Although these models can be conveniently manufactured using Transformers or DeepSpeed training framework, the training efficiency is comparably low. This phenomenon becomes even severer when the model scale exceeds 10 billion. The primary objective of Pai-Megatron-Patch is to effectively utilize the computational power of GPUs for LLM. This tool allows convenient training of commonly used LLM with all the accelerating techniques provided by Megatron-LM.
2627

2728
What's New:
29+
- **Support all Qwen3 Training with torch_dist checkpoint** [🔥🔥 2025.04.29]
2830
- **[Experimental]Support distributed checkpoint conversion for large LLM** [🔥🔥 2025.04.16]
2931
- **Upgrade DeepSeek-V3 SFT with fully Mcore implementation.** [🔥🔥 2025.03.31]
3032
- **Support training QwQ by using Megatron-Core.** [🔥🔥 2025.03.27]

README_zh-CN.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
| | Megatron-LM-Dense | Megatron-Core-Dense | Megatron-Core-MoE | MegaBlocks-MoE |
44
|:------------|:--------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------:|
5+
| Qwen3 | N/A | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen3/README.md#Megatron-Core模型训练流程) | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen3/README.md#Megatron-Core模型训练流程) | N/A |
56
| QwQ | N/A | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwq/README.md#Megatron-Core模型训练流程) | N/A | N/A |
67
| Qwen2.5-VL | N/A | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen2_5_vl/README.md#Megatron-Core模型训练流程) | N/A | N/A |
78
| Moonlight | N/A | N/A | [ReadMe](https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/moonlight/README.md#Megatron-Core-MoE模型训练流程) | N/A |
@@ -45,6 +46,7 @@ Pai-Megatron-Patch是各类开源大模型和Megatron训练加速引擎之间的
4546
- [阿里云PAI获得FewCLUE基于大模型的小样本学习双料冠军](https://developer.aliyun.com/article/788081?spm=a2c6h.12873639.article-detail.17.11c5383cHpFZks&tlog=yuekan_8)
4647

4748
新功能:
49+
- **支持全系列Qwen3模型基于torch_dist权重格式的训练微调** [🔥🔥 2025.04.29]
4850
- **[实验性]实现用于超大参数量模型的MG/HF权重分布式转换** [🔥🔥 2025.04.16]
4951
- **升级完善DeepSeek-V3训练微调流程** [🔥🔥 2025.03.31]
5052
- **支持用Megatron-Core框架训练QwQ模型** [🔥🔥 2025.03.27]

examples/qwen3/README.md

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# Qwen3 MoE 模型在Pai-Megatron-Patch的最佳实践
2+
3+
## Table of Contents
4+
* [安装](#安装)
5+
* [数据集&模型下载](#数据集和模型下载)
6+
* [Megatron-Core模型训练流程](#Megatron-Core模型训练流程)
7+
* [模型格式转换](#Megatron-Core模型格式转换)
8+
* [继续预训练](#预训练示例)
9+
* [指令微调](#指令微调示例)
10+
* [下游任务评估](#下游任务评估)
11+
* [Megatron-Core模型格式转换](#评估格式转换)
12+
* [运行评估工具](#运行评估工具)
13+
14+
15+
## 安装
16+
17+
请在阿里云人工智能平台PAI产品中填写专属镜像地址: `dsw-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pai-megatron-patch:25.04`
18+
19+
运行下列代码克隆Pai-Megatron-Patch
20+
```bash
21+
git clone --recurse-submodules https://github.com/alibaba/Pai-Megatron-Patch.git
22+
cd Pai-Megatron-Patch
23+
```
24+
25+
目前Qwen3-MoE已支持使用FlashAttention-3加速计算,但只能在Hopper架构的GPU卡上进行运算。若需要在H卡上使用FA3,请在DSW的容器中按如下指令安装并保存镜像
26+
```bash
27+
pip install "git+https://github.com/Dao-AILab/flash-attention.git#egg=flashattn-hopper&subdirectory=hopper"
28+
python_path=`python -c "import site; print(site.getsitepackages()[0])"`
29+
mkdir -p $python_path/flashattn_hopper
30+
wget -P $python_path/flashattn_hopper https://raw.githubusercontent.com/Dao-AILab/flash-attention/main/hopper/flash_attn_interface.py
31+
```
32+
33+
## 预训练数据集和模型下载
34+
35+
```bash
36+
cd /mnt
37+
mkdir qwen-ckpts
38+
cd qwen-ckpts
39+
git clone https://www.modelscope.cn/Qwen/Qwen3-30B-A3B.git
40+
41+
mkdir qwen-datasets
42+
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/mmap_qwen3_datasets_text_document.bin
43+
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/mmap_qwen3_datasets_text_document.idx
44+
45+
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/datasets/alpaca_zh-train-general.json
46+
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/datasets/alpaca_zh-valid-general.json
47+
48+
49+
```
50+
51+
## Megatron-Core模型训练流程
52+
### Megatron-Core模型格式转换
53+
当前qwen3已升级至`torch_dist`格式权重训练,为了进行权重转换,需要传入的参数列表如下
54+
```
55+
MODEL_SIZE=$1 # 模型大小,0.6B, 1.7B, 4B, 8B, 14B, 32B, A3B, A22B
56+
LOAD_DIR=$2 # 源权重路径
57+
SAVE_DIR=$3 # 目标权重路径
58+
MG2HF=$4 # 转换方向 可选: true, false
59+
USE_CUDA=$5 # 是否使用GPU转换 建议: true
60+
PR=$6 # 转换精度 可选: fp32 bf16 fp16
61+
HF_DIR=$7 # HF权重路径(mcore2hf时必须提供)
62+
```
63+
例如,使用下述脚本将checkpoint转换到MCore格式
64+
65+
```bash
66+
cd /workspace/Pai-Megatron-Patch/toolkits/distributed_checkpoints_convertor
67+
bash scripts/qwen3/run_8xH20.sh \
68+
A3B \
69+
/mnt/qwen-ckpts/Qwen3-30B-A3B \
70+
/mnt/qwen-ckpts/Qwen3-30B-A3B-to-mcore \
71+
false \
72+
true \
73+
bf16
74+
```
75+
76+
如果需要自定义转换脚本,请参阅分布式转换工具。
77+
78+
### Megatron-Core预训练及指令微调
79+
在Qwen3 MoE中,我们已将预训练和微调整合到`run_mcore_qwen3.sh`脚本,对于不同的使用场景,二者各参数的意义有所不同。
80+
81+
#### 预训练&微调命令统一描述
82+
需要传入的参数列表如下:
83+
```bash
84+
ENV=$1 # 运行环境配置开关: dsw单机训练训练,dlc表示多机训练环境
85+
MODEL_SIZE=$2 # 模型结构参数量级: 0.6B, 1.7B, 4B, 8B, 14B, 32B, A3B, A22B
86+
BATCH_SIZE=$3 # 一次迭代一个数据并行内的样本数
87+
GLOBAL_BATCH_SIZE=$4 # 一次迭代多个数据并行的总样本数
88+
LR=$5 # 学习率
89+
MIN_LR=$6 # 最小学习率
90+
SEQ_LEN=$7 # 序列长度
91+
PAD_LEN=$8 # Padding长度
92+
PR=${9} # 训练精度: fp16, bf16, fp8
93+
TP=${10} # 模型并行度
94+
PP=${11} # 流水并行度
95+
CP=${12} # 上下文并行度
96+
ETP=${13} # 专家张量并行度
97+
EP=${14} # 专家模型并行度
98+
SP=${15} # 是否使用序列并行: true, false
99+
DO=${16} # 是否使用Megatron版Zero-1降显存优化器: true, false
100+
FL=${17} # 是否优先使用Flash Attention: true, false
101+
SFT=${18} # 是否执行微调训练: true, false
102+
AC=${19} # 激活检查点模式: sel, full, offload, false
103+
OPTIMIZER_OFFLOAD=${20} # 是否启用Offload optimizer: false, 或输入0~1的小数作为参数offload比例
104+
SAVE_INTERVAL=${21} # 保存ckpt的间隔
105+
DATASET_PATH=${22} # 训练数据集路径
106+
VALID_DATASET_PATH=${23} # 验证数据集路径
107+
PRETRAIN_CHECKPOINT_PATH=${24} # 预训练模型路径
108+
TRAIN_TOKENS_OR_ITERS=${25} # 训练TOKEN或者Iter数
109+
WARMUP_TOKENS_OR_ITERS=${26} # 预热TOKEN或者Iter数
110+
OUTPUT_BASEPATH=${27} # 训练输出日志文件路径
111+
```
112+
113+
#### 预训练示例
114+
使用以下命令启动对qwen2的继续预训练。
115+
备注:当`AC=offload``full`时,可设置`MP_AC_LAYERS`环境变量来控制Checkpointing或Offload的TransformerLayer层数(默认值:`1`)。
116+
117+
```bash
118+
cd /workspace/Pai-Megatron-Patch/examples/qwen3
119+
sh run_mcore_qwen3.sh \
120+
dlc \
121+
A3B \
122+
1 \
123+
8 \
124+
1e-5 \
125+
1e-6 \
126+
128 \
127+
128 \
128+
bf16 \
129+
4 \
130+
2 \
131+
1 \
132+
1 \
133+
4 \
134+
true \
135+
true \
136+
true \
137+
false \
138+
sel \
139+
false \
140+
100000 \
141+
/mnt/qwen-datasets/mmap_qwen3_datasets_text_document \
142+
/mnt/qwen-datasets/mmap_qwen3_datasets_text_document \
143+
/mnt/qwen-ckpts/Qwen3-30B-A3B-to-mcore \
144+
10000 \
145+
100 \
146+
/mnt/logs/output_mcore_qwen3_pretrain
147+
```
148+
149+
#### 指令微调示例
150+
制作idxmap用于微调的数据集可以参考[链接](https://github.com/alibaba/Pai-Megatron-Patch/tree/main/toolkits/sft_data_preprocessing)
151+
当准备好微调数据集后,将SFT开关设置为`true`即可进行指令微调。
152+
153+
```bash
154+
cd /workspace/Pai-Megatron-Patch/examples/qwen3
155+
sh run_mcore_qwen3.sh \
156+
dlc \
157+
A3B \
158+
1 \
159+
8 \
160+
1e-5 \
161+
1e-6 \
162+
128 \
163+
128 \
164+
bf16 \
165+
4 \
166+
2 \
167+
1 \
168+
1 \
169+
4 \
170+
true \
171+
true \
172+
true \
173+
true \
174+
sel \
175+
false \
176+
100000 \
177+
/mnt/qwen-datasets/path_to_your_dataset \
178+
/mnt/qwen-datasets/path_to_your_dataset \
179+
/path/to/pretraining/checkpoint \
180+
10000 \
181+
100 \
182+
/workspace/output_mcore_qwen3_finetune
183+
```
184+
通过设置MP_DATASET_TYPE环境变量,本脚本还可使用json格式的数据集进行指令微调
185+
```bash
186+
export MP_DATASET_TYPE="raw"
187+
cd /workspace/Pai-Megatron-Patch/examples/qwen3
188+
sh run_mcore_qwen3_moe.sh \
189+
dlc \
190+
A3B \
191+
1 \
192+
8 \
193+
1e-5 \
194+
1e-6 \
195+
128 \
196+
128 \
197+
bf16 \
198+
4 \
199+
2 \
200+
1 \
201+
1 \
202+
4 \
203+
true \
204+
true \
205+
true \
206+
true \
207+
sel \
208+
false \
209+
100000 \
210+
/mnt/qwen-datasets/alpaca_zh-train-general.json \
211+
/mnt/qwen-datasets/alpaca_zh-valid-general.json \
212+
/mnt/qwen-ckpts/Qwen3-30B-A3B-to-mcore \
213+
10000 \
214+
100 \
215+
/workspace/output_mcore_qwen3_finetune
216+
```
217+
218+
## 下游任务评估
219+
220+
### 评估格式转换
221+
您需要将训练/微调后保存的Megatron-Core转换为HuggingFace格式来进行推理评估。
222+
223+
```bash
224+
cd /workspace/Pai-Megatron-Patch/toolkits/distributed_checkpoints_convertor
225+
bash scripts/qwen3/run_8xH20.sh \
226+
A3B \
227+
/mnt/qwen-ckpts/Qwen3-30B-A3B-to-mcore \
228+
/mnt/qwen-ckpts/Qwen3-30B-A3B-mcore-to-hf \
229+
true \
230+
true \
231+
bf16 \
232+
/mnt/qwen-ckpts/Qwen3-30B-A3B
233+
```
234+
235+
### 运行评估工具
236+
下载评估数据
237+
```bash
238+
# In container
239+
cd /workspace
240+
241+
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/evaluation-datasets/evaluate.tgz
242+
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/evaluation-datasets/cmmlu.tgz
243+
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/evaluation-datasets/ceval.tgz
244+
245+
tar -xvzf cmmlu.tgz
246+
tar -xvzf ceval.tgz
247+
tar -xvzf evaluate.tgz
248+
```
249+
运行以下指令对转换后的模型进行评估。
250+
```bash
251+
cd /workspace/Pai-Megatron-Patch/LM-Evaluation-Harness-240310
252+
accelerate launch --main_process_port 29051 -m lm_eval \
253+
--model hf \
254+
--model_args pretrained=/mnt/qwen-ckpts/Qwen3-30B-A3B-mcore-te-to-hf,trust_remote_code=True \
255+
--tasks cmmlu,ceval-valid \
256+
--batch_size 16
257+
```

0 commit comments

Comments
 (0)