You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
finetune-lora: Add checkpoint saving & resuming from saved checkpoint
This PR adds checkpointing for fine-tuning:
- Add checkpoint saving every N steps with --checkpoint-save-steps
- Save complete training state: model weights, optimizer state, metadata
- Implement two-phase optimizer state loading to avoid memory issues
- Add --resume-from-checkpoint and --auto-resume functionality
- Store optimizer momentum/variance tensors in GGUF format
- Add checkpoint validation for rank, alpha, and target modules
- Update README.md with checkpointing documentation
The optimizer state loading: iteration count is loaded during initialization,
while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates
the proper tensor structures.
0 commit comments