Implement huggingface checkpoint loading and export #1305

lshpku · 2025-10-17T06:08:02Z

实现正确的Huggingface Checkpoint加载&导出功能

由于该模型的 expert_id 是局部的，所以之前用 from_pretrained 无法正确加载 expert 权重，本PR重写了加载逻辑，现在可以正确映射 expert_id 了，21B和300B模型均可使用

简介

我们现在checkpoint有两种格式，一种是训练专用的格式，会把优化器状态也保存下来，方便进行checkpoint断点接续；另一种是unified模式，只保存模型本体，训推都能用，但是loss就不接续了，所以一般只在训练完导出给推理的时候才用

由于checkpoint格式不同，所以首次训练、断点接续和导出需要用不同的配置，如下所示：

首次训练

首次开始训练时，需在trainer_args下面将save_to_hf设为false：

trainer_args:
    save_to_hf: false

（如果没有这个参数需要新增，有的话就修改，下同）

断点接续

假设之前保存了训练100个step时的checkpoint，然后停掉了，想从100 step继续训练，则在trainer_args下面指定：

trainer_args:
    resume_from_checkpoint: ./output/checkpoint-100
    save_to_hf: false

最终导出

假设想导出训练500个step时的checkpoint，则在trainer_args下面指定：

trainer_args:
    resume_from_checkpoint: ./output/checkpoint-500
    save_to_hf: true
    unified_checkpoint: true
    max_steps: 1

（相当于假装从500 step恢复训练，但不跑任何step，不更新权重，直接保存为unified格式）

警告：trainer_args: from_scratch: 0/1参数需全程保持一致，也就是说如果你在首次训练时用了from_scratch: 0，那后面断点接续和最终导出时也必须使用from_scratch: 0，不能改成1，反之亦然，否则你会发现加载权重的 loss 非常高！

运行预训练

下载好相应模型的权重，以ERNIE-4.5-300B-A47B-Base-Paddle为例
由于下载的权重中的 config.json 是按照推理来的，一些参数甚至会报错，所以需要用本仓库中针对训练的model_configs/ERNIE-4p5-300B-A47B/model_config.json替换掉原有 config.json
在模型的yaml中，修改以下2个参数

model_name_or_path: /path/to/your/ERNIE-4.5-300B-A47B-Base-Paddle
from_scratch: 0

为了加快测试速度和避免显存问题，建议也修改以下参数

use_recompute: true  # 最大显存68G
use_fp8_mlp: false        # 加快启动速度，约5min
use_fp8_fuse_node: false  # 同上
gradient_accumulation_steps: 20  # 每step约30s
save_to_hf: false  # 用新版paddleformers时需设置，否则保存的checkpoint无法加载

然后按照scripts/ERNIE-4p5-300B-A47B/train_96_gpus.sh启动即可

环境建议

使用原版镜像，不要更新 paddle 和 torch，仅更新 transformers==4.55.1 aistudio-sdk==0.3.8 即可
paddleformers 建议用 origin/release/v0.2 版本，不要用太新的版本，否则保存的 checkpoint 不兼容，当然如果你不用 checkpoint 转换功能可以不用管
如果报错 vocab_size 不匹配，可以检查下模型的 config.json 中是否为"vocab_size": 103424，本仓库的值可能和模型不一样，以模型的为准

正确性确认

21B首轮loss约2.1，global_grad_norm约20
300B
- 对齐版本首轮loss约2.2，global_grad_norm约100
- Base版本首轮loss约1.6，global_grad_norm约30

paddle-bot · 2025-10-17T06:08:08Z

Thanks for your contribution!

lshpku force-pushed the load-huggingface-ckpt branch 2 times, most recently from bf45b74 to b35dcd9 Compare October 23, 2025 08:24

lshpku force-pushed the load-huggingface-ckpt branch from b35dcd9 to c4c77d4 Compare November 19, 2025 11:25

lshpku force-pushed the load-huggingface-ckpt branch 2 times, most recently from 2c2fd43 to 09fdf27 Compare December 19, 2025 06:01

lshpku force-pushed the load-huggingface-ckpt branch 2 times, most recently from 1efcca7 to 99c5910 Compare December 19, 2025 12:11

Implement huggingface checkpoint loading and export

d41bd78

lshpku force-pushed the load-huggingface-ckpt branch from 99c5910 to d41bd78 Compare January 7, 2026 06:10

lshpku changed the title ~~Implement huggingface checkpoint loading~~ Implement huggingface checkpoint loading and export Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement huggingface checkpoint loading and export #1305

Implement huggingface checkpoint loading and export #1305

Uh oh!

lshpku commented Oct 17, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Implement huggingface checkpoint loading and export #1305

Are you sure you want to change the base?

Implement huggingface checkpoint loading and export #1305

Uh oh!

Conversation

lshpku commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

实现正确的Huggingface Checkpoint加载&导出功能

简介

首次训练

断点接续

最终导出

运行预训练

环境建议

正确性确认

Uh oh!

paddle-bot bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lshpku commented Oct 17, 2025 •

edited

Loading