Implement huggingface checkpoint loading and export #1305
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
实现正确的Huggingface Checkpoint加载&导出功能
由于该模型的 expert_id 是局部的,所以之前用 from_pretrained 无法正确加载 expert 权重,本PR重写了加载逻辑,现在可以正确映射 expert_id 了,21B和300B模型均可使用
简介
我们现在checkpoint有两种格式,一种是训练专用的格式,会把优化器状态也保存下来,方便进行checkpoint断点接续;另一种是unified模式,只保存模型本体,训推都能用,但是loss就不接续了,所以一般只在训练完导出给推理的时候才用
由于checkpoint格式不同,所以首次训练、断点接续和导出需要用不同的配置,如下所示:
首次训练
首次开始训练时,需在trainer_args下面将save_to_hf设为false:
(如果没有这个参数需要新增,有的话就修改,下同)
断点接续
假设之前保存了训练100个step时的checkpoint,然后停掉了,想从100 step继续训练,则在trainer_args下面指定:
最终导出
假设想导出训练500个step时的checkpoint,则在trainer_args下面指定:
(相当于假装从500 step恢复训练,但不跑任何step,不更新权重,直接保存为unified格式)
警告:
trainer_args: from_scratch: 0/1参数需全程保持一致,也就是说如果你在首次训练时用了from_scratch: 0,那后面断点接续和最终导出时也必须使用from_scratch: 0,不能改成1,反之亦然,否则你会发现加载权重的 loss 非常高!运行预训练
下载好相应模型的权重,以
ERNIE-4.5-300B-A47B-Base-Paddle为例由于下载的权重中的 config.json 是按照推理来的,一些参数甚至会报错,所以需要用本仓库中针对训练的
model_configs/ERNIE-4p5-300B-A47B/model_config.json替换掉原有 config.json在模型的yaml中,修改以下2个参数
scripts/ERNIE-4p5-300B-A47B/train_96_gpus.sh启动即可环境建议
"vocab_size": 103424,本仓库的值可能和模型不一样,以模型的为准正确性确认