lora: Update readme; add architecture overview

makaveli10 · makaveli10 · commit 0fd18defb3c9 · 2025-11-27T13:05:49.000-05:00
diff --git a/README.md b/README.md
@@ -516,6 +516,20 @@ To learn more about model quantization, [read this documentation](tools/quantize
 
     </details>
 
+## LoRA Fine-Tuning
+
+llama.cpp includes native [LoRA](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation) fine-tuning across CPU, Vulkan, Metal and CUDA backends.
+
+LoRA fine-tuning represents the weight updates with two smaller matrices through low-rank decomposition while keeping the base model frozen. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. This makes training possible on devices with very limited memory, including phones and integrated GPUs. Key capabilities include:
+
+- Train LoRA adapters on any GPU (NVIDIA, AMD, Intel, Apple, Mali, Adreno)
+- Full support for FP32/FP16/Q8/Q4 training paths
+- Instruction-tuning via assistant-only masked loss
+- Checkpointing + resumable training
+- Merge LoRA adapters back into a base model `model.gguf`
+- Compatible with Qwen3, Gemma, LLaMA, TinyLlama, and other GGUF models
+
+The [Finetuning Guide](examples/training//README.md) has more details.
 
 ## Contributing
 
diff --git a/examples/training/README.md b/examples/training/README.md
@@ -69,8 +69,6 @@ the base model frozen, making it memory-efficient.
   - Available: `attn_q`, `attn_k`, `attn_v`, `attn_o`, `ffn_gate`, `ffn_up`, `ffn_down`, `embed`, `output`, `all`
   - Default: `attn_q,attn_k,attn_v,attn_o` (attention modules)
 - `--output-adapter PATH` - Output adapter filename (default: auto-generated)
-- `--assistant-loss-only` - Trains only on assistant tokens
-- `--chat-template` - Jinja chat template for chat ML formatting to train on assistant tokens only
 
 #### Checkpointing
 - `--checkpoint-save-steps N` - Save checkpoint every N training steps (default: 100)
@@ -83,6 +81,14 @@ the base model frozen, making it memory-efficient.
 - `-f FILE` - Training dataset
 - `-ngl N` - GPU layers (use 999 for full GPU training)
 - `-c N` - Context length (512 recommended for mobile)
+- `--assistant-loss-only` - Trains only on assistant tokens
+- `--chat-template` - Jinja chat template for chat ML formatting to train on assistant tokens only
+- `--learning-rate` - AdamW learning rate (default: 1e-5)
+- `--weight-decay` - AdamW weight decay (default: 1e-2)
+- `--lr-scheduler` - Learning rate scheduler: constant, cosine, linear (default: constant)
+- `--lr-min` - Minimum LR for cosine/linear schedulers (default: 0)
+- `--warmup-ratio` - Fraction of total steps for LR warmup (default: 0.0)
+- `--warmup-steps` - Explicit warmup steps (overrides warmup-ratio)
 
 
 ### Using Trained Adapters
@@ -96,8 +102,6 @@ After training, you'll get a small adapter file. Use it with the original base m
 ### Checkpointing
 
 The LoRA fine-tuning supports automatic checkpointing to save and resume training progress:
-
-#### Features
 - **Automatic saving**: Model and optimizer state saved every N training steps
 - **Complete state**: Includes LoRA weights, optimizer momentum, and training metadata
 - **Resume capability**: Continue training from exact step with full optimizer state
@@ -109,6 +113,63 @@ Each checkpoint directory contains:
 - `optimizer.gguf` - Optimizer state (momentum, variance, iteration)
 - `metadata.json` - Training parameters and step information
 
+### Architecture Overview
+
+This section explains how LoRA fine-tuning is implemented in llama.cpp:
+
+**LoRA Adapter Management (`src/llama-lora-training.cpp`):**
+This file manages the complete lifecycle of LoRA adapters:
+
+1. **Adapter Creation (`llama_lora_create_adapter()`):**
+  - Iterates through all model tensors to find target modules
+  - Creates low-rank matrix pairs (A, B) for each selected module
+  - Tensor naming: `blk.{layer}.{module}.lora_a` and `blk.{layer}.{module}.lora_b`
+  - Dimensions: `A ∈ R^(d×r)`, `B ∈ R^(r×k)` where r is the rank
+
+2. **Weight Initialization (`llama_lora_init_tensor_weights()`):**
+  - Matrix A: Initialized with Gaussian distribution N(0, init_std)
+  - Matrix B: Initialized to zeros
+  - This ensures ΔW = BA starts at zero (no adaptation initially)
+  - Supports both CPU and GPU tensors via `ggml_backend_tensor_set()`
+
+3. **Buffer Allocation (`llama_lora_allocate_buffers()`):**
+  - Auto-detects backend from base model (CPU/CUDA/Vulkan)
+  - Allocates LoRA tensors on same device as model layers
+  - Uses `ggml_backend_alloc_ctx_tensors_from_buft()` for optimal placement
+
+4. **Module Selection:**
+  - Scans tensor names for patterns: `attn_q`, `attn_k`, `attn_v`, `attn_output`
+  - FFN modules: `ffn_gate`, `ffn_up`, `ffn_down`
+  - Controlled by `target_modules` bitmask (lines 194-211)
+
+5. **Optimizer Integration (`llama_opt_param_filter_lora()`):**
+  - Filter function for `ggml-opt` to identify trainable parameters
+  - Returns `true` only for tensors with `.lora_a` or `.lora_b` suffix
+  - Ensures base model weights are excluded from gradient computation
+
+6. **Checkpointing (`llama_lora_save_checkpoint()`):**
+  - Creates checkpoint directory structure
+  - Saves `model.gguf` (LoRA weights via `llama_lora_save_adapter()`)
+  - Saves `optimizer.gguf` (optimizer state via `ctx->opt_save_state()`)
+  - Both files required for resuming training
+
+**Forward Pass (`ggml-opt.cpp:ggml_opt_forward()`):**
+1. Input batch flows through base model with LoRA injections
+2. Loss computation uses only trainable LoRA parameters, we mark only the lora-parameters as trainable with `llama_opt_param_filter_lora`
+3. For instruction tuning with `--assistant-loss-only`, loss masking is applied to system/user tokens
+
+**Backward Pass (`ggml-opt.cpp:ggml_opt_backward()`):**
+1. Gradients computed only for LoRA adapters (matrices A and B)
+2. Base model weights are excluded from gradient computation
+3. Memory efficient: only stores gradients for low-rank matrices
+
+**Optimizer State (`ggml-opt.cpp`):**
+- Uses AdamW optimizer by default
+- Maintains first moment (momentum) and second moment (variance) for each LoRA parameter
+- State tensors: `opt_state_m` and `opt_state_v` for each adapter matrix
+- Checkpoint format includes full optimizer state for seamless resumption
+
+
 ### Troubleshooting
 
 - **Out of memory**: Reduce context length (`-c 256`), lower rank, or use fewer target modules