finetune-lora: Add checkpoint saving & resuming from saved checkpoint #32

makaveli10 · 2025-10-08T21:25:51Z

This PR adds checkpointing for fine-tuning:

Add checkpoint saving every N steps with --checkpoint-save-steps
Save complete training state: model weights, optimizer state, metadata
Implement two-phase optimizer state loading to avoid memory issues
Add --resume-from and --auto-resume functionality
Store optimizer momentum/variance tensors in GGUF format
Add checkpoint validation for rank, alpha, and target modules
Update README.md with checkpointing documentation

The optimizer state loading: iteration count is loaded during initialization, while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates the proper tensor structures.

This commit adds checkpointing for fine-tuning: - Add checkpoint saving every N steps with --checkpoint-save-steps - Save complete training state: model weights, optimizer state, metadata - Implement two-phase optimizer state loading to avoid memory issues - Add --resume-from and --auto-resume functionality - Store optimizer momentum/variance tensors in GGUF format - Add checkpoint validation for rank, alpha, and target modules - Update README.md with checkpointing documentation The optimizer state loading: iteration count is loaded during initialization, while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates the proper tensor structures.

Signed-off-by: Marcus Edel <[email protected]>

…sion using .string().c_str(). Signed-off-by: Marcus Edel <[email protected]>

gianni-cor · 2025-10-16T08:41:17Z

examples/training/finetune-lora.cpp

+    params.escape = false;
+    parse_finetune_args(argc, argv, ft_params);

    if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_PERPLEXITY)) {


can we make all parameters from LLAMA_EXAMPLE_PERPLEXITY also be available in LLAMA_EXAMPLE_MAIN ? as we use this last one in the addon to access all tunable parameters.

olyasir · 2025-10-17T07:31:18Z

ggml/include/ggml-opt.h

    // get the gradient accumulator for a node from the forward graph
    GGML_API struct ggml_tensor * ggml_opt_grad_acc(ggml_opt_context_t opt_ctx, struct ggml_tensor * node);
+
+    // get optimizer state tensors (momentum and variance for AdamW)


Can we also add SGD support, so we could choose between them on command line? SGD have fewer parameters, and looks like it is already supported in llama.cpp

@olyasir We would recommend merging this one and checking the SGD support in a later PR? If you insist we can try supporting it in this PR as well.

No problem, can be in separate pr

github-actions bot added examples ggml labels Oct 8, 2025

zoq added 2 commits October 9, 2025 11:21

Use the correct function call for the IM2COL_3D op.

d17ba35

Signed-off-by: Marcus Edel <[email protected]>

Change format types to resolve warnings.

947d3f9

Signed-off-by: Marcus Edel <[email protected]>

github-actions bot added the Vulkan label Oct 9, 2025

Fix cross-platform compilation errors: Windows filesystem path conver…

4610e02

…sion using .string().c_str(). Signed-off-by: Marcus Edel <[email protected]>

gianni-cor self-requested a review October 10, 2025 07:26

olyasir approved these changes Oct 16, 2025

View reviewed changes

gianni-cor reviewed Oct 16, 2025

View reviewed changes

makaveli10 closed this Oct 16, 2025

makaveli10 reopened this Oct 16, 2025

makaveli10 changed the base branch from temp-latest to temp-latest-finetuning October 16, 2025 20:13

olyasir reviewed Oct 17, 2025

View reviewed changes

olyasir approved these changes Oct 22, 2025

View reviewed changes

olyasir merged commit a5810ed into tetherto:temp-latest-finetuning Oct 22, 2025
84 of 94 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

finetune-lora: Add checkpoint saving & resuming from saved checkpoint #32

finetune-lora: Add checkpoint saving & resuming from saved checkpoint #32

Uh oh!

makaveli10 commented Oct 8, 2025 •

edited

Loading

Uh oh!

gianni-cor Oct 16, 2025

Uh oh!

olyasir Oct 17, 2025

Uh oh!

makaveli10 Oct 22, 2025

Uh oh!

olyasir Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

finetune-lora: Add checkpoint saving & resuming from saved checkpoint #32

finetune-lora: Add checkpoint saving & resuming from saved checkpoint #32

Uh oh!

Conversation

makaveli10 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianni-cor Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

olyasir Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

makaveli10 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

olyasir Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

makaveli10 commented Oct 8, 2025 •

edited

Loading