Skip to content

Commit 6ac427d

Browse files
llama-router: document KISS philosophy, optimization patterns, and system architecture
1 parent 18df088 commit 6ac427d

File tree

1 file changed

+217
-22
lines changed

1 file changed

+217
-22
lines changed

tools/router/README.md

Lines changed: 217 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,49 @@ llama-router builds and runs on both Linux and Windows:
2828

2929
The router automatically detects the llama-server binary location relative to its own executable, falling back to PATH resolution if not found.
3030

31+
## Design Philosophy
32+
33+
llama-router follows KISS (Keep It Simple, Stupid) principles:
34+
35+
- **Minimal configuration**: Works out-of-box with HF cache scanning
36+
- **Explicit persistence**: Config changes are written explicitly via admin endpoints, never hidden in business logic
37+
- **Separation of concerns**: Core routing logic (`RouterApp`) has zero I/O, persistence handled by admin layer
38+
- **Simple endpoint matching**: Prefix-based matching, no complex regex
39+
- **Transparent proxy**: Headers and streaming forwarded as-is
40+
- **On-demand only**: No models start at boot, everything spawns when first requested
41+
42+
### The auto + default_spawn Workflow
43+
44+
Models discovered from the HuggingFace cache are marked as `auto` and inherit the `default_spawn` configuration. This creates a powerful optimization pattern:
45+
46+
1. **Tune `default_spawn` once** with your preferred parameters (GPU layers, KV cache quantization, context size, etc.)
47+
2. **All `auto` models automatically use these settings** - no per-model configuration needed
48+
3. **Change `default_spawn` and reload** - all `auto` models instantly updated
49+
4. **Customize individual models** by switching to `manual` state first to prevent rescan overwrites
50+
51+
This ensures consistent, optimized behavior across your entire model collection while allowing per-model overrides when needed. **Always set models to `manual` before customizing their spawn parameters** - otherwise your changes will be lost on the next rescan.
52+
53+
## Multi-Engine Support
54+
55+
llama-router is engine-agnostic. Any OpenAI-compatible inference backend can be orchestrated by configuring the appropriate spawn command and endpoints. The router simply:
56+
57+
1. Spawns the command specified in `spawn.command`
58+
2. Polls `health_endpoint` until it returns HTTP 200 (customizable per backend)
59+
3. Proxies requests matching `proxy_endpoints` to the running instance
60+
61+
This design allows you to mix llama.cpp, vLLM, Ollama, Text Generation Inference, or any custom backend in a single router configuration. Set models to `manual` state when using non-llama.cpp backends to prevent automatic cache rescans from removing them.
62+
63+
### Future: WebUI Administration (TODO)
64+
65+
The admin API endpoints (`/admin/reload`, `/admin/rescan`) are designed to support hot configuration and model management. A future WebUI will enable:
66+
67+
- **Live model downloads** from HuggingFace directly through the interface
68+
- **Hot reconfiguration** of `default_spawn` and per-model settings without restart
69+
- **Real-time monitoring** of running instances and resource usage
70+
- **Interactive model management** (add, remove, customize spawn parameters)
71+
72+
This aligns with the project philosophy: **everything configurable at runtime, zero downtime required**. The current CLI and JSON-based workflow is production-ready; the WebUI will provide a more accessible interface to the same underlying admin API.
73+
3174
---
3275

3376
## Quick Start
@@ -53,8 +96,14 @@ Simply launch the router:
5396

5497
On first run, it will:
5598
1. Create a default configuration at `~/.config/llama.cpp/router-config.json`
56-
2. Scan the Hugging Face cache (`~/.cache/huggingface/hub/`) for existing models
57-
3. Start listening on `127.0.0.1:8082`
99+
2. Scan the Hugging Face cache (`~/.cache/llama.cpp/`) for GGUF models
100+
3. Add discovered models as `auto` state, inheriting `default_spawn` configuration
101+
4. Start listening on `127.0.0.1:8082`
102+
103+
On every subsequent startup:
104+
- Automatic rescan updates the model list (adds new, removes deleted cache files)
105+
- All `auto` models inherit the current `default_spawn` settings
106+
- `manual` models preserve their custom configurations
58107

59108
---
60109

@@ -126,13 +175,21 @@ The import process:
126175

127176
The router tracks each model's origin through a `state` field, which controls behavior during rescans:
128177

178+
### Important: On-Demand Spawning
179+
180+
**All models spawn only when first requested via the API.** The router never starts backends at boot. The `auto`/`manual` state controls only rescan behavior:
181+
182+
- `auto`: Managed by cache scanner, inherits `default_spawn`
183+
- `manual`: Protected from rescans, can have custom `spawn` configuration
184+
129185
### `auto` State
130186

131187
Models discovered automatically from the Hugging Face cache are marked as `"state": "auto"`. These models:
132188

133189
- Are added when first discovered in the cache
134190
- Are **removed automatically** if the cached file disappears (e.g., cache cleanup)
135191
- Are re-added if the file reappears
192+
- **Inherit `default_spawn` configuration** - change `default_spawn` to optimize all `auto` models at once
136193

137194
This enables seamless synchronization with `huggingface-cli` downloads and cache management.
138195

@@ -143,26 +200,13 @@ Models added via `--import-dir` or edited by hand in the config are marked as `"
143200
- Are **never automatically removed**, even if the file path becomes invalid
144201
- Must be manually deleted from the configuration
145202
- Survive rescans and configuration reloads
203+
- **Can have custom `spawn` configurations** that override `default_spawn`
146204

147205
**Use cases for manual state:**
148206
- Models on network storage that may be temporarily unavailable
149207
- Fine-tuned models in development directories
150208
- Models you want to persist regardless of file system changes
151209

152-
### Changing State
153-
154-
Edit the configuration file directly to change a model's state:
155-
156-
```json
157-
{
158-
"name": "my-model.gguf",
159-
"path": "/path/to/my-model.gguf",
160-
"state": "manual"
161-
}
162-
```
163-
164-
Or set to `"auto"` if you want the router to manage its lifecycle.
165-
166210
---
167211

168212
## Configuration
@@ -198,7 +242,7 @@ Override with `--config`:
198242
"models": [
199243
{
200244
"name": "Qwen3-8B-Q4_K_M.gguf",
201-
"path": "/home/user/.cache/huggingface/hub/models--bartowski--Qwen3-8B-GGUF/...",
245+
"path": "/home/user/.cache/llama.cpp/bartowski_Qwen3-8B-GGUF_Qwen3-8B-Q4_K_M.gguf",
202246
"state": "auto",
203247
"group": ""
204248
}
@@ -234,6 +278,43 @@ The router automatically appends these arguments:
234278
- `--port <port>` - Dynamically assigned port
235279
- `--host 127.0.0.1` - Localhost binding for security
236280

281+
### Optimizing for Your Hardware
282+
283+
The `default_spawn` is where you tune performance for your specific hardware. **All `auto` models inherit these settings**, so you can optimize once for your entire collection:
284+
285+
```json
286+
{
287+
"default_spawn": {
288+
"command": [
289+
"llama-server",
290+
"-ngl", "999",
291+
"-ctk", "q8_0",
292+
"-ctv", "q8_0",
293+
"-fa", "on",
294+
"--mlock",
295+
"-np", "4",
296+
"-kvu",
297+
"--jinja"
298+
],
299+
"proxy_endpoints": ["/v1/", "/health", "/slots", "/props"],
300+
"health_endpoint": "/health"
301+
}
302+
}
303+
```
304+
305+
**Common optimizations:**
306+
- `-ngl 999`: Offload all layers to GPU
307+
- `-ctk q8_0 -ctv q8_0`: Quantize KV cache to Q8 for lower VRAM usage
308+
- `-fa on`: Enable Flash Attention
309+
- `--mlock`: Lock model in RAM to prevent swapping
310+
- `-np 4`: Process 4 prompts in parallel
311+
- `-kvu`: Use single unified KV buffer for all sequences (also `--kv-unified`)
312+
- `--jinja`: Enable Jinja template support
313+
314+
**Note:** The router automatically appends `--model`, `--port`, and `--host` - do not include these in your command.
315+
316+
Change `default_spawn`, reload the router, and all `auto` models instantly use the new configuration.
317+
237318
### Per-Model Spawn Override
238319

239320
Individual models can override the default spawn configuration:
@@ -258,25 +339,29 @@ Individual models can override the default spawn configuration:
258339

259340
### Model Groups
260341

261-
Groups ensure mutual exclusivity - when a model from a group is requested, any running model from a **different** group is stopped first:
342+
**Default Behavior: Single Model at a Time**
343+
344+
llama-router is designed for resource-constrained environments (small GPUs, consumer hardware). By default, **only ONE model runs at a time** - when you request a different model, the current one is stopped first. This ensures reliable operation on systems with limited VRAM.
345+
346+
To allow multiple models to run simultaneously, assign the **same group** to models that can coexist:
262347

263348
```json
264349
{
265350
"models": [
266351
{
267352
"name": "qwen3-8b-q4",
268353
"path": "/path/to/qwen3-8b-q4.gguf",
269-
"group": "8b-models"
354+
"group": "small-models"
270355
},
271356
{
272357
"name": "qwen3-8b-q8",
273358
"path": "/path/to/qwen3-8b-q8.gguf",
274-
"group": "8b-models"
359+
"group": "small-models"
275360
},
276361
{
277362
"name": "llama-70b-q4",
278363
"path": "/path/to/llama-70b-q4.gguf",
279-
"group": "70b-models"
364+
"group": "large-model"
280365
}
281366
]
282367
}
@@ -286,7 +371,7 @@ Behavior:
286371
- Requesting `qwen3-8b-q4` while `qwen3-8b-q8` is running: **no restart** (same group)
287372
- Requesting `llama-70b-q4` while `qwen3-8b-q4` is running: **stops qwen3, starts llama** (different group)
288373

289-
If no group is specified, the model name is used as a singleton group.
374+
**Omitting the `group` field creates an exclusive singleton per model** - each model stops all others before starting.
290375

291376
---
292377

@@ -393,6 +478,30 @@ curl http://localhost:8082/admin/rescan
393478

394479
## Architecture
395480

481+
### File Structure & Separation of Concerns
482+
483+
| Component | Files | Responsibility |
484+
|-----------|-------|----------------|
485+
| **Core** | `router-app.cpp/h` | Model lifecycle, spawn orchestration, group logic (zero I/O) |
486+
| **HTTP Endpoints** | `router-endpoints.cpp/h` | Public API routes (`/v1/models`, `/v1/chat/completions`) |
487+
| **Admin** | `router-admin.cpp/h` | Admin routes with explicit config persistence |
488+
| **Proxy** | `router-proxy.cpp/h` | HTTP forwarding, SSE streaming, header management |
489+
| **Process** | `router-process.cpp/h` | Cross-platform subprocess spawning, I/O capture |
490+
| **Config** | `router-config.cpp/h` | JSON load/write, rescan logic, `RescanResult` |
491+
| **Scanner** | `router-scanner.cpp/h` | HF cache discovery, `--import-dir`, mmproj detection |
492+
| **Main** | `router.cpp` | CLI parsing, server setup, signal handlers |
493+
| **Utils** | `logging.cpp/h`, `router-constants.h` | Shared logging and constants |
494+
495+
**Design principles enforced:**
496+
- `router-app`: Pure business logic, no filesystem I/O
497+
- `router-admin`: Owns config persistence, explicit writes only
498+
- `router-proxy`: Streaming & forwarding, value-captured lambdas to avoid use-after-free
499+
- `router-process`: Platform abstraction, child processes never call parent logging functions
500+
501+
---
502+
503+
### System Architecture
504+
396505
```
397506
┌─────────────────────────────────────────────────────────────┐
398507
│ llama-router │
@@ -438,6 +547,92 @@ curl http://localhost:8082/admin/rescan
438547

439548
---
440549

550+
## Technical Notes
551+
552+
### Cross-Platform Process Management
553+
554+
The router handles subprocess spawning differently per platform:
555+
556+
**Linux/macOS:** Uses `fork()` + `execvp()` with careful attention to post-fork behavior. Child processes **must not** call logging functions that access parent singletons - they write directly to `STDERR_FILENO` instead to avoid use-after-fork crashes.
557+
558+
**Windows:** Uses `CreateProcess()` with separate process information structures and handle management.
559+
560+
### SSE Streaming Implementation
561+
562+
Server-Sent Events streaming required careful lifetime management to avoid use-after-free bugs:
563+
564+
1. **Capture by value**: Lambda captures must copy request data (headers, path, body), not reference stack variables that become invalid after the handler returns
565+
2. **Explicit termination**: Call `sink.done()` followed by `return false` to signal httplib to close the connection properly - without this, streams deliver tokens correctly but never terminate
566+
567+
### PATH Binary Resolution
568+
569+
Spawn commands support both absolute/relative paths and PATH-based binaries:
570+
571+
- **Paths with separators**: `/usr/bin/llama-server`, `./llama-server`, `C:\llama\server.exe` - existence validated before spawn
572+
- **PATH binaries**: `python`, `vllm`, `ollama`, `llama-server` - no validation, relies on shell PATH resolution
573+
574+
The router only validates file existence for commands containing `/` or `\\` path separators, allowing seamless use of system-installed binaries.
575+
576+
### Model-Scoped Route Stripping
577+
578+
Routes like `/<model>/health` are router-side aliases for convenience. Before proxying to the backend, the router strips the model prefix:
579+
580+
- User request: `GET /Qwen3-8B-Q4_K_M.gguf/health`
581+
- Forwarded to backend: `GET /health`
582+
583+
Backends remain unaware of model-scoped routing - they expose standard endpoints like `/health`, `/v1/chat/completions`, etc.
584+
585+
### HTTP Header Management
586+
587+
The router strips `Content-Length` and `Transfer-Encoding` headers before forwarding requests. This is standard reverse-proxy behavior to handle chunked requests/responses properly and avoid conflicts when the proxy re-chunks data.
588+
589+
All other headers are forwarded transparently to preserve client context (authentication, user-agent, etc.).
590+
591+
### Health Endpoint Purpose
592+
593+
The `health_endpoint` configuration field serves **spawn readiness polling only** - the router uses it to detect when a backend has finished loading and is ready to serve requests.
594+
595+
This is separate from user-facing health routes. Clients can still call `/<model>/health` or `/health` for their own monitoring needs. The backend must expose standard endpoints regardless of what `health_endpoint` is configured for polling.
596+
597+
### Multimodal Projector Priority
598+
599+
When importing collections with `--import-dir`, mmproj files are automatically detected with this search priority:
600+
601+
1. `*-bf16.gguf` (selected first)
602+
2. `*-f16.gguf` (selected if BF16 not found)
603+
3. `*-f32.gguf` (selected if neither BF16 nor F16 found)
604+
605+
All quantization variants of a model (Q4_K_M, Q5_K_M, Q6_K, etc.) found in the same directory share the same mmproj file.
606+
607+
**For manual models:** mmproj auto-detection applies only during initial import. You can edit `spawn.command` to remove `--mmproj` if unwanted - your changes persist across restarts. Only `auto` models get their spawn configuration regenerated on rescan.
608+
609+
### Manifest Robustness
610+
611+
The HF cache scanner gracefully handles missing or corrupted manifest files:
612+
613+
- If `~/.cache/llama.cpp/` doesn't exist, scanner returns empty mapping
614+
- If individual manifest files are missing, they're silently skipped
615+
- Models without manifest entries load successfully, just without mmproj auto-detection
616+
617+
**Cache structure example:**
618+
```
619+
~/.cache/llama.cpp/
620+
├── bartowski_Qwen2.5-1.5B-Instruct-GGUF_Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
621+
├── bartowski_Qwen2.5-1.5B-Instruct-GGUF_Qwen2.5-1.5B-Instruct-Q4_K_M.gguf.etag
622+
├── manifest=bartowski=Qwen2.5-1.5B-Instruct-GGUF=latest.json
623+
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_Qwen3-VL-4B-Instruct-Q6_K.gguf
624+
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_Qwen3-VL-4B-Instruct-Q6_K.gguf.etag
625+
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_mmproj-F16.gguf
626+
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_mmproj-F16.gguf.etag
627+
└── manifest=unsloth=Qwen3-VL-4B-Instruct-GGUF=Q6_K.json
628+
```
629+
630+
Manifest files (`manifest=vendor=repo=quant.json`) contain metadata for mmproj auto-detection. The scanner uses underscore separators: `vendor_repo_filename.gguf`.
631+
632+
This ensures the router remains operational even with incomplete cache metadata.
633+
634+
---
635+
441636
## Signals and Shutdown
442637

443638
The router handles graceful shutdown on:

0 commit comments

Comments
 (0)