Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -516,6 +516,20 @@ To learn more about model quantization, [read this documentation](tools/quantize

</details>

#### LoRA Fine-Tuning

llama.cpp includes native LoRA (Low-Rank Adaptation) fine-tuning across CPU, Vulkan, Metal and CUDA backends.

LoRA fine-tuning updates only a small set of low-rank matrices while keeping the base model frozen. This makes training possible on devices with very limited memory, including phones and integrated GPUs. Key capabilities include:

- Train LoRA adapters on any GPU (NVIDIA, AMD, Intel, Apple, Mali, Adreno)
- Full support for FP32/FP16/Q8/Q4 training paths
- Instruction-tuning via assistant-only masked loss
- Checkpointing + resumable training
- Merge LoRA adapters back into a standalone .gguf
- Compatible with Qwen3, Gemma, LLaMA, TinyLlama, and other GGUF models

The [Finetuning Guide](examples/training//README.md) has more details.

## Contributing

Expand Down
8 changes: 8 additions & 0 deletions examples/training/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,20 @@

# llama.cpp/examples/training

## What is LoRA Fine-Tuning?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for large models. Instead of updating all model weights, LoRA injects a pair of small, trainable low-rank matrices (A and B) into selected layers of the model. During training, only these matrices are updated, while the original model weights remain frozen. After training, the LoRA adapters can be merged with the base model for inference or kept separate for modularity.

## finetune
This directory contains examples related to language model training using llama.cpp/GGML.
So far finetuning is technically functional (for FP32 models and limited hardware setups) but the code is very much WIP.
Finetuning of Stories 260K and LLaMA 3.2 1b seems to work with 24 GB of memory.
**For CPU training, compile llama.cpp without any additional backends such as CUDA.**
**For CUDA training, use the maximum number of GPU layers.**

---


Proof of concept:

``` sh
Expand Down
Loading