[WIP] Feat: add checkpoint loading mechanism by JYMiracle305 · Pull Request #146 · InfiniTensor/InfiniTrain

JYMiracle305 · 2026-04-21T05:56:22Z

Checkpoint 读取工具主要参数：

--checkpoint_dir 训练过程中的保存目录
--save_steps 每 N 次保存一次，设置为 0 则不保存
--max_checkpoint_keep 最多保留 K 个 checkpoint
--save_optimizer_state 是否保存优化器的状态
--resume_from 从指定 checkpoint 目录恢复训练

Checkpoint 文件可以通过从 /data/shared/....../llmc/gpt2 (or llama3) 的原始模型参数训练而来，例子可见仓库中的 REPORT.md（Experiment 实际上也测试了llama3，但是命令只记录了 GPT2 训练），model.bin, optimizer.bin, trainer_state.json 都可以从训练中获取．因此不在附件中提供

Experiment

CUDA_VISIBLE_DEVICES=5,6,7 ./gpt2 --input_bin ../../data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath ../../data/llmc/gpt2/gpt2_124M.bin --checkpoint_dir ../ckpt2/gpt2-noresume/ --num_iteration 100 --save_steps 20 --save_optimizer_state true --max_checkpoint_keep 10

CUDA_VISIBLE_DEVICES=5,6,7 ./gpt2 --input_bin ../../data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath ../../data/llmc/gpt2/gpt2_124M.bin --checkpoint_dir ../ckpt2/gpt2-resumefrom40/ --num_iteration 100 --save_steps 20 --save_optimizer_state true --max_checkpoint_keep 10 --resume_from ../ckpt2/gpt2-noresume/checkpoint_step_000040/ > ../ckpt2/gpt2-resumefrom40/gpt2-resume.log 2>&1

（以上两条训练命令同样用 llama3 也运行了）

运行 compare_loss.py，对于 llama3 模型，由于从 step 40 恢复训练，所以 step 1~40 数据缺失，而其余 60 步的 loss 在 FP32, BF16 下均吻合

  Summary: 60/100 steps matched

==================================================
Overall Summary:
  fp32:    0/1 test cases passed (threshold: 1e-05)
  bfloat16: 0/0 test cases passed (threshold: 1e-02)
  Total:   0/1 test cases passed
==================================================

==================================================
Overall Summary:
  fp32:    0/0 test cases passed (threshold: 1e-05)
  bfloat16: 0/1 test cases passed (threshold: 1e-02)
  Total:   0/1 test cases passed
==================================================

对于 GPT2，模型保存的逻辑有误：训练中 lm_head 与 wte 并非真共享，而 LLMC 存取又按“共享”假设处理，resume 后 lm_head 很容易和 no resume 不一致。解决方法是把训练用 checkpoint 从 LLMC 回调路径切到原生 StateDict 二进制路径，并在加载后显式重建权重绑定语义 (example/gpt2/main.cc)．经过修复后，也可以通过．

format: use clang-format-16 instead

remove redundent arguments

format files

ArcaLunar and others added 5 commits April 21, 2026 11:25

feat: checkpoint save & load

69a0729

format: format files in examples and infini_train

ade0893

format: use clang-format-16 instead

feat: extract resuming to utils

39e89bf

remove redundent arguments

feat: extract similar logic in ckpt_save

b363779

format files

feat(checkpoint): reorganize checkpoint code and improve robustness

0a3deb2

JYMiracle305 force-pushed the feature/add_checkpoint branch from e8c5dd5 to 0a3deb2 Compare April 24, 2026 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feat: add checkpoint loading mechanism#146

[WIP] Feat: add checkpoint loading mechanism#146
JYMiracle305 wants to merge 5 commits intomasterfrom
feature/add_checkpoint

JYMiracle305 commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JYMiracle305 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experiment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JYMiracle305 commented Apr 21, 2026 •

edited

Loading