Skip to content

Commit ea21768

Browse files
committed
finetune-lora: Add checkpoint saving & resuming from saved checkpoint
This PR adds checkpointing for fine-tuning: - Add checkpoint saving every N steps with --checkpoint-save-steps - Save complete training state: model weights, optimizer state, metadata - Implement two-phase optimizer state loading to avoid memory issues - Add --resume-from-checkpoint and --auto-resume functionality - Store optimizer momentum/variance tensors in GGUF format - Add checkpoint validation for rank, alpha, and target modules - Update README.md with checkpointing documentation The optimizer state loading: iteration count is loaded during initialization, while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates the proper tensor structures.
1 parent 9bf57f1 commit ea21768

File tree

10 files changed

+711
-46
lines changed

10 files changed

+711
-46
lines changed

examples/training/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,14 @@ the base model frozen, making it memory-efficient.
3636
# Fine-tune existing LoRA adapter
3737
./build/bin/llama-finetune-lora -m base_model.gguf -f dataset.txt --lora existing_adapter.gguf \
3838
--output-adapter improved_adapter.gguf -ngl 999 -c 512 -b 512 -ub 512
39+
40+
# Training with checkpointing
41+
./build/bin/llama-finetune-lora -m model.gguf -f dataset.txt -ngl 999 -c 512 -b 512 -ub 512 \
42+
--checkpoint-save-steps 50 --checkpoint-save-dir "./lora_checkpoints"
43+
44+
# Resume training from checkpoint
45+
./build/bin/llama-finetune-lora -m model.gguf -f dataset.txt -ngl 999 -c 512 -b 512 -ub 512 \
46+
--resume-from-checkpoint "./lora_checkpoints/checkpoint_step_00000150/"
3947
```
4048

4149

@@ -53,6 +61,12 @@ the base model frozen, making it memory-efficient.
5361
- Default: `attn_q,attn_k,attn_v,attn_o` (attention modules)
5462
- `--output-adapter PATH` - Output adapter filename (default: auto-generated)
5563

64+
#### Checkpointing
65+
- `--checkpoint-save-steps N` - Save checkpoint every N training steps (default: 100)
66+
- `--checkpoint-save-dir PATH` - Directory for checkpoints (default: `./checkpoints`)
67+
- `--resume-from-checkpoint PATH` - Resume training from specific checkpoint directory
68+
- `--auto-resume` - Automatically resume from latest checkpoint in save directory
69+
5670
#### Standard Parameters
5771
- `-m MODEL` - Base model file (.gguf)
5872
- `-f FILE` - Training dataset
@@ -68,11 +82,28 @@ After training, you'll get a small adapter file. Use it with the original base m
6882
./build/bin/llama-cli -m base_model.gguf --lora trained_adapter.gguf -ngl 999
6983
```
7084

85+
### Checkpointing
86+
87+
The LoRA fine-tuning supports automatic checkpointing to save and resume training progress:
88+
89+
#### Features
90+
- **Automatic saving**: Model and optimizer state saved every N training steps
91+
- **Complete state**: Includes LoRA weights, optimizer momentum, and training metadata
92+
- **Resume capability**: Continue training from exact step with full optimizer state
93+
- **Auto-resume**: Automatically find and resume from latest checkpoint
94+
95+
#### Checkpoint Structure
96+
Each checkpoint directory contains:
97+
- `model.gguf` - LoRA adapter weights
98+
- `optimizer.gguf` - Optimizer state (momentum, variance, iteration)
99+
- `metadata.json` - Training parameters and step information
100+
71101
### Troubleshooting
72102

73103
- **Out of memory**: Reduce context length (`-c 256`), lower rank, or use fewer target modules
74104
- **Poor quality**: Increase rank, add more target modules, or train longer
75105
- **Large adapter**: Reduce rank or limit target modules
106+
- **Checkpoint issues**: Ensure checkpoint directory contains all required files (model.gguf, optimizer.gguf, metadata.json)
76107

77108
### Help
78109

0 commit comments

Comments
 (0)