Skip to content

Commit a76d2d9

Browse files
zoqgianni-cor
authored andcommitted
Add guide about how to support a new model.
Signed-off-by: Marcus Edel <[email protected]>
1 parent 5df8e41 commit a76d2d9

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

examples/training/README.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,100 @@ This file manages the complete lifecycle of LoRA adapters:
169169
- State tensors: `opt_state_m` and `opt_state_v` for each adapter matrix
170170
- Checkpoint format includes full optimizer state for seamless resumption
171171

172+
### Adding Support for a New Model Architecture in llama.cpp
173+
174+
Supporting a new Transformer model in llama.cpp is not automatic — it requires
175+
implementing missing operators, adding tokenizer + prompt formatting, and
176+
validating training/inference parity with a reference framework. The process is
177+
repeatable and model families like Gemma, Qwen, Mistral, Grok, Phi can be
178+
onboarded by following a consistent workflow.
179+
180+
Below is a generalized end-to-end porting procedure, derived from our work enabling
181+
Gemma/Qwen inference and LoRA finetuning across CPU, Vulkan, Metal, and CUDA backends.
182+
183+
#### Step 1 — Analyze the Architecture
184+
185+
| Component | Example Differences Across Models |
186+
| ------------------- | ----------------------------------------------------- |
187+
| Attention structure | Grouped-QKV, Multi-query, Sliding-window, Flash-attn |
188+
| FFN type | SiLU-MLP (LLaMA), GEGLU (Gemma), SWIGLU/MoE (Mixtral) |
189+
| Positional encoding | RoPE, XPos, ALiBi |
190+
| Tokenizer format | BPE, SentencePiece, Unigram |
191+
| Chat/Prompt style | ChatML, Gemini-style blocks, Turn-format |
192+
193+
If the model deviates from LLaMA in any FFN or attention component — new ops will be required.
194+
195+
#### Step 2 — Implement Missing Operators
196+
197+
Almost all new model bring at least one of these requirements:
198+
199+
| Op Type | Purpose | Required for |
200+
| ------------------------ | ------------------------------- | ---------------------- |
201+
| Feed-forward activations | GEGLU, SWIGLU, MoE dispatch | inference + training |
202+
| Loss + grad kernels | CROSS_ENTROPY_BACK, MASKED_LOSS | LoRA/SFT training |
203+
| Matrix update ops | OUT_PROD, OUT_PROD_BACK | LoRA parameter updates |
204+
205+
#### Step 3 — Backend Kernel Extension
206+
207+
Each operator must exist in at least one backend to work — but training must support
208+
Vulkan, CUDA, CPU, optionally Metal.
209+
210+
| Backend | Required For |
211+
| ------- | ---------------------------------------------------------- |
212+
| CPU | reference correctness + unit tests |
213+
| Vulkan | cross-platform inference + LoRA (Adreno, Mali, AMD, Intel) |
214+
| CUDA | high throughput training |
215+
| Metal | iOS / Apple Silicon finetuning |
216+
217+
Special attention for mobile Vulkan:
218+
219+
- operators must tile to fit GPU SSBO memory windows
220+
- OUT_PROD + MUL_MAT need dynamic splitting
221+
- quantized INT4/INT8 variants reduce VRAM footprint dramatically
222+
223+
#### Step 4 — Add Tokenizer, Prompt Format, Chat Template
224+
225+
Even if inference works, instruction finetuning will fail without chat formatting.
226+
227+
You must implement:
228+
229+
```
230+
tokenizer.json / spm.model: convert to tokenizer.gguf
231+
default chat.jinja: system/user/assistant roles
232+
assistant-only masking: loss applies only to assistant tokens
233+
```
234+
235+
Then train:
236+
237+
```bash
238+
./llama-finetune-lora -m newmodel.gguf -f data.jsonl \
239+
--assistant-loss-only --chat-template template.jinja \
240+
--lora-rank 16 -ngl 999
241+
```
242+
243+
#### Step 5 — Validation Workflow
244+
245+
Before claiming model support:
246+
247+
| Test | Pass Criteria |
248+
| ------------------------- | ----------------------------------------------- |
249+
| Generate text | No NaNs, stable token distribution |
250+
| Forward-only parity | Output ≈ PyTorch within float tolerance |
251+
| 50–200 step LoRA run | Loss decreases consistently |
252+
| Merge-adapter → inference | identical behavior to runtime adapter injection |
253+
| Finetune resumption | checkpoint restore reproducible |
254+
255+
If inference works but training diverges, most of the time it's a missing backward op.
256+
257+
#### Example — Adding Support for Gemma (Inference + LoRA)
258+
259+
Gemma is a non-LLaMA architecture requiring GEGLU feed-forward layers, which means
260+
new forward + backward operators must be implemented before LoRA finetuning becomes
261+
functional. The reference implementation for this exists in the [Gemma integration
262+
PR](https://github.com/tetherto/qvac-ext-lib-llama.cpp/pull/63).
263+
264+
This PR demonstrates a complete integration path: inference, instruction fine-tuning,
265+
adapter merging, making it an ideal template when porting additional architectures.
172266

173267
### Troubleshooting
174268

0 commit comments

Comments
 (0)