@@ -169,6 +169,100 @@ This file manages the complete lifecycle of LoRA adapters:
169169- State tensors: ` opt_state_m ` and ` opt_state_v ` for each adapter matrix
170170- Checkpoint format includes full optimizer state for seamless resumption
171171
172+ ### Adding Support for a New Model Architecture in llama.cpp
173+
174+ Supporting a new Transformer model in llama.cpp is not automatic — it requires
175+ implementing missing operators, adding tokenizer + prompt formatting, and
176+ validating training/inference parity with a reference framework. The process is
177+ repeatable and model families like Gemma, Qwen, Mistral, Grok, Phi can be
178+ onboarded by following a consistent workflow.
179+
180+ Below is a generalized end-to-end porting procedure, derived from our work enabling
181+ Gemma/Qwen inference and LoRA finetuning across CPU, Vulkan, Metal, and CUDA backends.
182+
183+ #### Step 1 — Analyze the Architecture
184+
185+ | Component | Example Differences Across Models |
186+ | ------------------- | ----------------------------------------------------- |
187+ | Attention structure | Grouped-QKV, Multi-query, Sliding-window, Flash-attn |
188+ | FFN type | SiLU-MLP (LLaMA), GEGLU (Gemma), SWIGLU/MoE (Mixtral) |
189+ | Positional encoding | RoPE, XPos, ALiBi |
190+ | Tokenizer format | BPE, SentencePiece, Unigram |
191+ | Chat/Prompt style | ChatML, Gemini-style blocks, Turn-format |
192+
193+ If the model deviates from LLaMA in any FFN or attention component — new ops will be required.
194+
195+ #### Step 2 — Implement Missing Operators
196+
197+ Almost all new model bring at least one of these requirements:
198+
199+ | Op Type | Purpose | Required for |
200+ | ------------------------ | ------------------------------- | ---------------------- |
201+ | Feed-forward activations | GEGLU, SWIGLU, MoE dispatch | inference + training |
202+ | Loss + grad kernels | CROSS_ENTROPY_BACK, MASKED_LOSS | LoRA/SFT training |
203+ | Matrix update ops | OUT_PROD, OUT_PROD_BACK | LoRA parameter updates |
204+
205+ #### Step 3 — Backend Kernel Extension
206+
207+ Each operator must exist in at least one backend to work — but training must support
208+ Vulkan, CUDA, CPU, optionally Metal.
209+
210+ | Backend | Required For |
211+ | ------- | ---------------------------------------------------------- |
212+ | CPU | reference correctness + unit tests |
213+ | Vulkan | cross-platform inference + LoRA (Adreno, Mali, AMD, Intel) |
214+ | CUDA | high throughput training |
215+ | Metal | iOS / Apple Silicon finetuning |
216+
217+ Special attention for mobile Vulkan:
218+
219+ - operators must tile to fit GPU SSBO memory windows
220+ - OUT_PROD + MUL_MAT need dynamic splitting
221+ - quantized INT4/INT8 variants reduce VRAM footprint dramatically
222+
223+ #### Step 4 — Add Tokenizer, Prompt Format, Chat Template
224+
225+ Even if inference works, instruction finetuning will fail without chat formatting.
226+
227+ You must implement:
228+
229+ ```
230+ tokenizer.json / spm.model: convert to tokenizer.gguf
231+ default chat.jinja: system/user/assistant roles
232+ assistant-only masking: loss applies only to assistant tokens
233+ ```
234+
235+ Then train:
236+
237+ ``` bash
238+ ./llama-finetune-lora -m newmodel.gguf -f data.jsonl \
239+ --assistant-loss-only --chat-template template.jinja \
240+ --lora-rank 16 -ngl 999
241+ ```
242+
243+ #### Step 5 — Validation Workflow
244+
245+ Before claiming model support:
246+
247+ | Test | Pass Criteria |
248+ | ------------------------- | ----------------------------------------------- |
249+ | Generate text | No NaNs, stable token distribution |
250+ | Forward-only parity | Output ≈ PyTorch within float tolerance |
251+ | 50–200 step LoRA run | Loss decreases consistently |
252+ | Merge-adapter → inference | identical behavior to runtime adapter injection |
253+ | Finetune resumption | checkpoint restore reproducible |
254+
255+ If inference works but training diverges, most of the time it's a missing backward op.
256+
257+ #### Example — Adding Support for Gemma (Inference + LoRA)
258+
259+ Gemma is a non-LLaMA architecture requiring GEGLU feed-forward layers, which means
260+ new forward + backward operators must be implemented before LoRA finetuning becomes
261+ functional. The reference implementation for this exists in the [ Gemma integration
262+ PR] ( https://github.com/tetherto/qvac-ext-lib-llama.cpp/pull/63 ) .
263+
264+ This PR demonstrates a complete integration path: inference, instruction fine-tuning,
265+ adapter merging, making it an ideal template when porting additional architectures.
172266
173267### Troubleshooting
174268
0 commit comments