PyTorch框架模型支持列表

PyTorch框架根据模型架构特点分为稠密模型、稀疏模型和状态空间模型三大类别，详情请查看以下支持列表。

表格字段说明：

模型：模型名称

下载链接：模型权重下载地址，点击可直接访问Hugging Face等模型仓库

脚本位置：模型在本项目中的训练脚本路径，可用于快速定位和使用模型。脚本环境变量说明

序列长度：支持的最大文本序列长度

训练后端：支持Legacy、Mcore、FSDP2三种实现方式的模型。Legacy为历史实现方式，Mcore为当前推荐的Megatron主流实现方式，FSDP2为PyTorch官方推荐的分布式训练实现方式。

集群规模：模型训练时推荐使用的集群规模配置，格式为"节点数×卡数"

支持版本：最终支持的维护版本，空白表示从上线起到当前master分支均在维护

贡献方：模型贡献来源

认证：【Pass】已通过官方测试；【Test】内部测试中。如有问题请反馈至issues

稠密模型

稠密模型（Dense Model）是传统的深度学习模型结构，其神经元之间的连接是密集的，每一层的大多数或所有神经元都与下一层的所有神经元相连。这种模型很简单，训练相对直接，但参数量较大，计算成本较高。

模型	下载链接	脚本位置	序列长度	训练后端	集群规模	支持版本	贡献方	认证
Aquila	7B	aquila	2K	Legacy	1x8	2.0.0	【GTS】	【Pass】
Aquila2	7B	aquila2	2K	Legacy	1x8	2.0.0	【GTS】	【Pass】
Aquila2	34B	aquila2	4K	Legacy	2x8	2.0.0	【GTS】	【Pass】
Baichuan	7B	baichuan	4K	Legacy	1x8	2.0.0	【GTS】	【Pass】
Baichuan	13B	baichuan	4K	Legacy	1x8	2.0.0	【GTS】	【Pass】
Baichuan2	7B	baichuan2	4K	Legacy	1x8	2.0.0	【Ascend】	【Pass】
Baichuan2	13B	baichuan2	4K	Mcore	1x8		【Ascend】	【Pass】
Bloom	7B1	bloom	2K	Legacy	1x8	2.0.0	【Ascend】	【Pass】
Bloom	176B	bloom	2K	Legacy	12x8	2.0.0	【Ascend】	【Pass】
ChatGLM3	6B	chatglm3	8K	Mcore	1x8		【Ascend】	【Pass】
			32K	Mcore	1x8		【Ascend】	【Pass】
			64K	Mcore	2x8		【Ascend】	【Pass】
GLM4	9B	glm4	8K	Mcore	1x8	2.2.0	【GTS】	【Pass】
GLM4	9B	glm4	32K	Mcore	2x8	2.2.0	【GTS】	【Pass】
CodeLlama	34B	codellama	4K	Mcore	2x8	2.2.0	【GTS】	【Pass】
InternLM	7B	intern	2K	Legacy	1x8	2.0.0	【Ascend】	【Pass】
InternLM	65B	intern	2K	Legacy	4x8	2.0.0	【Ascend】	【Pass】
InternLM2	20B	internlm2	4K	Mcore	1x8		【GTS】	【Pass】
InternLM2	20B	internlm2	32K	Mcore	1x8	2.2.0	【GTS】	【Pass】
InternLM2.5	1.8B	internlm25	32K	Mcore	1x8		【GTS】	【Pass】
	7B		32K	Mcore	1x8		【GTS】	【Pass】
	20B		32K	Mcore	2x8		【GTS】	【Test】
InternLM3	8B	internlm3	8K	Mcore	1x8		【Ascend】	【Pass】
LLaMA	7B	llama	2K	Legacy	1x8	2.0.0	【Ascend】	【Pass】
	13B		2K	Legacy	1x8	2.0.0	【Ascend】	【Pass】
	33B		2K	Legacy	4x8	2.0.0	【Ascend】	【Pass】
	65B		2K	Legacy	4x8	2.0.0	【Ascend】	【Pass】
LLaMA2	7B	llama2	4K	Mcore	1x8		【NAIE】	【Pass】
	13B		4K	Mcore	1x8		【NAIE】	【Pass】
	34B		4K	Mcore	2x8		【GTS】	【Pass】
	70B		4K	Mcore	4x8		【GTS】	【Pass】
	70B		128K	Mcore	8x8		【Ascend】	【Pass】
LLaMA3	8B	llama3	8K	Mcore	1x8		【GTS】	【Pass】
LLaMA3	70B	llama3	8K	Mcore	4x8		【GTS】	【Pass】
LLaMA3.1	8B	llama31	8K	Mcore	1x8		【GTS】	【Pass】
	8B		128K	Mcore	4x8		【GTS】	【Pass】
	50B		128K	Mcore	8x8		【Ascend】	【Pass】
	70B		8K	Mcore	4x8		【GTS】	【Pass】
	70B		128K	Mcore	24x8		【Ascend】	【Pass】
	200B		8K	Mcore	8x8		【Ascend】	【Pass】
	405B		8K	Mcore	8x8		【Ascend】	【Pass】
	405B		128K	Mcore	36x8		【Ascend】	【Pass】
LLaMA3.2	1B	llama32	8K	Mcore	1x8		【GTS】	【Pass】
LLaMA3.2	3B	llama32	8K	Mcore	1x8		【GTS】	【Pass】
LLaMA3.3	70B-Instruct	llama33	8K	Mcore	4x8		【GTS】	【Pass】
Qwen	7B	qwen	8K	Legacy	1x8	2.0.0	【GTS】	【Pass】
	14B		2K	Legacy	1x8	2.0.0	【GTS】	【Pass】
	72B		8K	Legacy	16x8	2.0.0	【GTS】	【Pass】
Qwen1.5	0.5B	qwen15	8K	Mcore	1x8	2.2.0	【GTS】	【Pass】
	1.8B		8K	Mcore	1x8		【GTS】	【Pass】
	4B		8K	Mcore	1x8		【GTS】	【Pass】
	7B		8K	Mcore	1x8		【GTS】	【Pass】
	14B		8K	Mcore	1x8		【GTS】	【Pass】
	32B		8K	Mcore	4x8		【GTS】	【Pass】
	72B		8K	Mcore	8x8		【GTS】	【Pass】
	110B		8K	Mcore	8x8		【GTS】	【Pass】
CodeQwen1.5	7B		8K	Mcore	1x8		【GTS】	【Pass】
Qwen2	0.5B	qwen2	4K	Mcore	1x8	2.2.0	【GTS】	【Pass】
	0.5B		32K	Mcore	1x8		【GTS】	【Pass】
	1.5B		4K	Mcore	1x8		【GTS】	【Pass】
	1.5B		32K	Mcore	1x8		【GTS】	【Pass】
	7B		4K	Mcore	1x8		【GTS】	【Pass】
	7B		32K	Mcore	1x8		【GTS】	【Pass】
	72B		4K	Mcore	4x8		【GTS】	【Pass】
	72B		32K	Mcore	16x8		【Ascend】	【Pass】
Qwen2.5	0.5B	qwen25	32K	Mcore	1x8		【GTS】	【Pass】
	1.5B		32K	Mcore	1x8		【GTS】	【Pass】
	3B		32K	Mcore	1x8		【GTS】	【Pass】
	7B		32K	Mcore	1x8		【Ascend】	【Pass】
	14B		32K	Mcore	2x8		【GTS】	【Pass】
	32B		32K	Mcore	4x8		【GTS】	【Pass】
	72B		32K	Mcore	16x8		【GTS】	【Pass】
Qwen3	0.6B	qwen3	4K	Mcore	1x8		【Ascend】	【Pass】
	1.7B		4K	Mcore	1x8		【Ascend】	【Pass】
	4B		4K	Mcore	1x8		【Ascend】	【Pass】
	8B		4K	Mcore	1x8		【Ascend】	【Pass】
	14B		4K	Mcore	1x8		【Ascend】	【Pass】
	32B		4K	Mcore	2x8		【Ascend】	【Pass】
	32B	qwen3	4K	FSDP2	1x16		【Ascend】	【Test】
QwQ	32B	qwq	4K	Mcore	1x8	2.2.0	【GTS】	【Test】
Qwen2.5-Math	1.5B	qwen25_math	4K	Mcore	1x8	2.2.0	【GTS】	【Pass】
	7B		4K	Mcore	1x8		【GTS】	【Pass】
	72B		4K	Mcore	4x8		【GTS】	【Test】
CodeQwen2.5	7B	qwen25_coder	8K	Mcore	1x8	2.2.0	【China Mobile Cloud】	【Test】
Yi	9B	yi	4K	Legacy	1x4	2.0.0	【OpenMind】	【Test】
Yi	34B	yi	4K	Mcore	2x8	2.2.0	【GTS】	【Pass】
Yi1.5	6B	yi15	4K	Mcore	1x8	2.2.0	【GTS】	【Pass】
	9B		4K	Mcore	1x8		【GTS】	【Pass】
	34B		4K	Mcore	2x8		【GTS】	【Test】
Mistral	7B	mistral	32K	Mcore	1x8	2.2.0	【NAIE】	【Pass】
Gemma	2B	gemma	8K	Mcore	1x8	2.2.0	【GTS】	【Pass】
Gemma	7B	gemma	8K	Mcore	1x8	2.2.0	【GTS】	【Pass】
Gemma2	9B	gemma2	8K	Mcore	1x8		【GTS】	【Pass】
Gemma2	27B	gemma2	8K	Mcore	2x8		【GTS】	【Pass】
MiniCPM	2B	minicpm	4K	Mcore	1x8	2.2.0	【NAIE】	【Pass】
MiniCPM3	4B	minicpm3	32K	Mcore	1x8	2.2.0	【GTS】	【Test】
Phi3.5	mini-instruct	phi35	4K	Mcore	1x8		【GTS】	【Test】
DeepSeek-Math	7B	deepseek_math	4K	Mcore	1x8	2.2.0	【Ascend】	【Test】
DeepSeek-R1-Distill-Qwen	1.5B	deepseek_r1_distill_qwen	4K	Mcore	1x8	2.2.0	【Ascend】	【Pass】
	7B		4K	Mcore	1x8		【Ascend】	【Pass】
	14B		4K	Mcore	1x8		【Ascend】	【Pass】
	32B		8K	Mcore	2x8		【Ascend】	【Pass】
DeepSeek-R1-Distill-LLaMA	8B	deepseek_r1_distill_llama	8K	Mcore	1x8	2.2.0	【Ascend】	【Pass】
DeepSeek-R1-Distill-LLaMA	70B	deepseek_r1_distill_llama	8K	Mcore	4x8	2.2.0	【Ascend】	【Pass】
Seed-OSS	36B	seed_oss	2K	Mcore	1x8		【Ascend】	【Test】
Magistral	24B	magistral	4K	Mcore	1x8		【Ascend】	【Test】
PLM	1.8B	plm	2K	Mcore	1x8		【Ascend】	【Test】

稀疏模型

稀疏模型（Sparse Model）采用了稀疏连接的神经元结构，只有少数神经元之间存在连接。典型的稀疏模型如混合专家模型（Mixture of Experts, MoE），包含多个专家网络，每次训练只激活部分专家。这种设计可以显著减少参数量和计算复杂度，提高训练效率，特别适合处理大规模数据集和复杂任务。但稀疏模型训练也存在缺点，易出现专家负载不均衡导致训练不稳定。

模型	下载链接	脚本位置	序列长度	训练后端	集群规模	支持版本	贡献方	认证
Qwen3	30B-A3B	qwen3_moe	4K	Mcore	2x8		【Ascend】	【Pass】
	30B-A3B	qwen3_moe	4K	FSDP2	1x16		【Ascend】	【Test】
	235B-A22B	qwen3_moe	4K	Mcore	16x16		【Ascend】	【Pass】
	235B-A22B	qwen3_moe	4K	FSDP2	16x16		【Ascend】	【Test】
Qwen3-Next	80B-A3B	qwen3_next	16K	Mcore	4x16		【Ascend】	【Pass】
Qwen3-Next	80B-A3B	qwen3_next	16K	FSDP2	4x16		【Ascend】	【Test】
Qwen3-Coder-Next	80B-A3B	qwen3_coder_next	16K	Mcore	4x16		【Ascend】	【Test】
Qwen2	57B-A14B	qwen2_moe	4K	Mcore	8x8	2.2.0	【GTS】	【Pass】
Grok-1	40B	grok-1	8K	Mcore	4x8	2.0.0	【GTS】	【Pass】
Mixtral	8x7B	mixtral	32K	Mcore	8x8	2.2.0	【Ascend】	【Pass】
	8x22B		32K	Mcore	8x8		【NAIE】	【Pass】
	8x22B		64K	Mcore	8x8		【NAIE】	【Test】
DeepSeek-V2	236B	deepseek2	8K	Mcore	20x8	2.2.0	【Ascend】	【Pass】
DeepSeek-V2-coder	236B	deepseek2_coder	8K	Mcore	20x8	2.2.0	【Ascend】	【Test】
DeepSeek-V2-Lite	16B	deepseek2_lite	8K	Mcore	1x8		【Ascend】	【Pass】
DeepSeek-V2.5	236B	deepseek25	8K	Mcore	20x8	2.2.0	【NAIE】	【Test】
DeepSeek-V3	671B	deepseek3	4K	Mcore	64x8		【Ascend】	【Pass】
DeepSeek-V3.2	671B	deepseek3.2	4K	Mcore	32x16		【Ascend】	【Test】
MiniCPM	8x2B	minicpm	4K	Mcore	1x8	2.2.0	【NAIE】	【Test】
Ling-mini-2.0	16B	ling_v2	4K	Mcore	1x8		【Ascend】	【Test】
Ring	1T	ling_v2	32K	Mcore	32x8		【Ascend】	【Test】
Phi3.5	MoE-instruct	phi35	4K	Mcore	2x8		【GTS】	【Test】
Hunyuan	389B	hunyuanLarge	8K	Mcore	8x8		【Ascend】	【Pass】
GPT4	MoE-175B	gpt4	128K	Mcore	8x8		【Ascend】	【Pass】
GLM4.5-Air	MoE-106B	glm45-moe	4K	Mcore	8x8		【Ascend】	【Test】
GLM5	MoE-744B	glm5	4K	Mcore	32x16		【Ascend】	【Test】
Step3.5-Flash	MoE-196B	step35	4K	FSDP2	12x16		【Ascend】	【Test】
LongCat	MoE-560B	longcat	4K	Mcore	8x16		【Ascend】	【Test】
GPT-OSS	MoE-20B	gpt_oss	4K	FSDP2	1x16		【Ascend】	【Test】

Note

GPT模型词表文件与常规模型不同，需按以下步骤配置：

mkdir vocab_file 
cd vocab_file
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
cd ..

# 处理成训练数据
python ./preprocess_data.py \
    --input ./dataset/ \
    --output-prefix ./dataset/gpt_text_sentence \
    --tokenizer-type GPT2BPETokenizer \
    --vocab-file ./vocab_file/gpt2-vocab.json \
    --merge-file ./vocab_file/gpt2-merges.txt \
    --append-eod \
    --workers 4 \
    --log-interval 1000

# 请根据真实存放路径配置预训练脚本以下参数
VOCAB_FILE="./vocab_file/gpt2-vocab.json"   # 词表
MERGE_FILE="./vocab_file/gpt2-merges.txt"   # BPE 合并表
DATA_PATH="./dataset/gpt_text_sentence"     # 数据路径

状态空间模型

状态空间模型（State Space Model, SSM）是一类基于状态空间表示的序列模型，能够高效地建模长序列数据。与传统的Transformer模型相比，SSM在处理长序列时具有更好的计算效率和内存效率。

模型	下载链接	脚本位置	序列长度	训练后端	集群规模	贡献方	认证
Mamba2	2.7B	mamba2	4K	Mcore	1x8	【Ascend】	【Test】
Mamba2	8B	mamba2	4K	Mcore	1x8	【Ascend】	【Test】
Mamba2Hybrid	8B	mamba2	4K	Mcore	1x8	【Ascend】	【Test】

Note

该开源模型未提供词表文件。内部测试使用的词表文件 mamba2_2.7b_from_8b.model 为自定义设计，建议用户自行构建词表，训练效果不作保证。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch框架模型支持列表

稠密模型

稀疏模型

状态空间模型

FilesExpand file tree

supported_models.md

Latest commit

History

supported_models.md

File metadata and controls

PyTorch框架模型支持列表

稠密模型

稀疏模型

状态空间模型