diff --git a/README.md b/README.md index 17f59e988e3..0198a67bb2e 100644 --- a/README.md +++ b/README.md @@ -516,6 +516,20 @@ To learn more about model quantization, [read this documentation](tools/quantize +#### LoRA Fine-Tuning + +llama.cpp includes native LoRA (Low-Rank Adaptation) fine-tuning across CPU, Vulkan, Metal and CUDA backends. + +LoRA fine-tuning updates only a small set of low-rank matrices while keeping the base model frozen. This makes training possible on devices with very limited memory, including phones and integrated GPUs. Key capabilities include: + +- Train LoRA adapters on any GPU (NVIDIA, AMD, Intel, Apple, Mali, Adreno) +- Full support for FP32/FP16/Q8/Q4 training paths +- Instruction-tuning via assistant-only masked loss +- Checkpointing + resumable training +- Merge LoRA adapters back into a standalone .gguf +- Compatible with Qwen3, Gemma, LLaMA, TinyLlama, and other GGUF models + +The [Finetuning Guide](examples/training//README.md) has more details. ## Contributing diff --git a/examples/training/README.md b/examples/training/README.md index ed255a0e1af..d24cd9077e5 100644 --- a/examples/training/README.md +++ b/examples/training/README.md @@ -1,5 +1,10 @@ + # llama.cpp/examples/training +## What is LoRA Fine-Tuning? + +LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for large models. Instead of updating all model weights, LoRA injects a pair of small, trainable low-rank matrices (A and B) into selected layers of the model. During training, only these matrices are updated, while the original model weights remain frozen. After training, the LoRA adapters can be merged with the base model for inference or kept separate for modularity. + ## finetune This directory contains examples related to language model training using llama.cpp/GGML. So far finetuning is technically functional (for FP32 models and limited hardware setups) but the code is very much WIP. @@ -7,6 +12,9 @@ Finetuning of Stories 260K and LLaMA 3.2 1b seems to work with 24 GB of memory. **For CPU training, compile llama.cpp without any additional backends such as CUDA.** **For CUDA training, use the maximum number of GPU layers.** +--- + + Proof of concept: ``` sh