Skip to content

Conversation

@makaveli10
Copy link

@makaveli10 makaveli10 commented Oct 8, 2025

This PR adds checkpointing for fine-tuning:

  • Add checkpoint saving every N steps with --checkpoint-save-steps
  • Save complete training state: model weights, optimizer state, metadata
  • Implement two-phase optimizer state loading to avoid memory issues
  • Add --resume-from and --auto-resume functionality
  • Store optimizer momentum/variance tensors in GGUF format
  • Add checkpoint validation for rank, alpha, and target modules
  • Update README.md with checkpointing documentation

The optimizer state loading: iteration count is loaded during initialization, while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates the proper tensor structures.

This commit adds checkpointing for fine-tuning:
- Add checkpoint saving every N steps with --checkpoint-save-steps
- Save complete training state: model weights, optimizer state, metadata
- Implement two-phase optimizer state loading to avoid memory issues
- Add --resume-from and --auto-resume functionality
- Store optimizer momentum/variance tensors in GGUF format
- Add checkpoint validation for rank, alpha, and target modules
- Update README.md with checkpointing documentation

The optimizer state loading: iteration count is loaded during initialization,
while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates
the proper tensor structures.
@github-actions github-actions bot added the Vulkan label Oct 9, 2025
…sion using .string().c_str().

Signed-off-by: Marcus Edel <[email protected]>
@gianni-cor gianni-cor self-requested a review October 10, 2025 07:26
params.escape = false;
parse_finetune_args(argc, argv, ft_params);

if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_PERPLEXITY)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make all parameters from LLAMA_EXAMPLE_PERPLEXITY also be available in LLAMA_EXAMPLE_MAIN ? as we use this last one in the addon to access all tunable parameters.

@makaveli10 makaveli10 closed this Oct 16, 2025
@makaveli10 makaveli10 reopened this Oct 16, 2025
@makaveli10 makaveli10 changed the base branch from temp-latest to temp-latest-finetuning October 16, 2025 20:13
// get the gradient accumulator for a node from the forward graph
GGML_API struct ggml_tensor * ggml_opt_grad_acc(ggml_opt_context_t opt_ctx, struct ggml_tensor * node);

// get optimizer state tensors (momentum and variance for AdamW)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add SGD support, so we could choose between them on command line? SGD have fewer parameters, and looks like it is already supported in llama.cpp

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olyasir We would recommend merging this one and checking the SGD support in a later PR? If you insist we can try supporting it in this PR as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, can be in separate pr

@olyasir olyasir merged commit a5810ed into tetherto:temp-latest-finetuning Oct 22, 2025
84 of 94 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants