Skip to content

Conversation

makaveli10
Copy link

@makaveli10 makaveli10 commented Oct 8, 2025

This PR adds checkpointing for fine-tuning:

  • Add checkpoint saving every N steps with --checkpoint-save-steps
  • Save complete training state: model weights, optimizer state, metadata
  • Implement two-phase optimizer state loading to avoid memory issues
  • Add --resume-from and --auto-resume functionality
  • Store optimizer momentum/variance tensors in GGUF format
  • Add checkpoint validation for rank, alpha, and target modules
  • Update README.md with checkpointing documentation

The optimizer state loading: iteration count is loaded during initialization, while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates the proper tensor structures.

This commit adds checkpointing for fine-tuning:
- Add checkpoint saving every N steps with --checkpoint-save-steps
- Save complete training state: model weights, optimizer state, metadata
- Implement two-phase optimizer state loading to avoid memory issues
- Add --resume-from and --auto-resume functionality
- Store optimizer momentum/variance tensors in GGUF format
- Add checkpoint validation for rank, alpha, and target modules
- Update README.md with checkpointing documentation

The optimizer state loading: iteration count is loaded during initialization,
while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates
the proper tensor structures.
@github-actions github-actions bot added the Vulkan label Oct 9, 2025
…sion using .string().c_str().

Signed-off-by: Marcus Edel <[email protected]>
@gianni-cor gianni-cor self-requested a review October 10, 2025 07:26
params.escape = false;
parse_finetune_args(argc, argv, ft_params);

if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_PERPLEXITY)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make all parameters from LLAMA_EXAMPLE_PERPLEXITY also be available in LLAMA_EXAMPLE_MAIN ? as we use this last one in the addon to access all tunable parameters.

@makaveli10 makaveli10 closed this Oct 16, 2025
@makaveli10 makaveli10 reopened this Oct 16, 2025
@makaveli10 makaveli10 changed the base branch from temp-latest to temp-latest-finetuning October 16, 2025 20:13
// get the gradient accumulator for a node from the forward graph
GGML_API struct ggml_tensor * ggml_opt_grad_acc(ggml_opt_context_t opt_ctx, struct ggml_tensor * node);

// get optimizer state tensors (momentum and variance for AdamW)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add SGD support, so we could choose between them on command line? SGD have fewer parameters, and looks like it is already supported in llama.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants