You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -28,6 +28,49 @@ llama-router builds and runs on both Linux and Windows:
28
28
29
29
The router automatically detects the llama-server binary location relative to its own executable, falling back to PATH resolution if not found.
30
30
31
+
## Design Philosophy
32
+
33
+
llama-router follows KISS (Keep It Simple, Stupid) principles:
34
+
35
+
-**Minimal configuration**: Works out-of-box with HF cache scanning
36
+
-**Explicit persistence**: Config changes are written explicitly via admin endpoints, never hidden in business logic
37
+
-**Separation of concerns**: Core routing logic (`RouterApp`) has zero I/O, persistence handled by admin layer
38
+
-**Simple endpoint matching**: Prefix-based matching, no complex regex
39
+
-**Transparent proxy**: Headers and streaming forwarded as-is
40
+
-**On-demand only**: No models start at boot, everything spawns when first requested
41
+
42
+
### The auto + default_spawn Workflow
43
+
44
+
Models discovered from the HuggingFace cache are marked as `auto` and inherit the `default_spawn` configuration. This creates a powerful optimization pattern:
45
+
46
+
1.**Tune `default_spawn` once** with your preferred parameters (GPU layers, KV cache quantization, context size, etc.)
47
+
2.**All `auto` models automatically use these settings** - no per-model configuration needed
48
+
3.**Change `default_spawn` and reload** - all `auto` models instantly updated
49
+
4.**Customize individual models** by switching to `manual` state first to prevent rescan overwrites
50
+
51
+
This ensures consistent, optimized behavior across your entire model collection while allowing per-model overrides when needed. **Always set models to `manual` before customizing their spawn parameters** - otherwise your changes will be lost on the next rescan.
52
+
53
+
## Multi-Engine Support
54
+
55
+
llama-router is engine-agnostic. Any OpenAI-compatible inference backend can be orchestrated by configuring the appropriate spawn command and endpoints. The router simply:
56
+
57
+
1. Spawns the command specified in `spawn.command`
58
+
2. Polls `health_endpoint` until it returns HTTP 200 (customizable per backend)
59
+
3. Proxies requests matching `proxy_endpoints` to the running instance
60
+
61
+
This design allows you to mix llama.cpp, vLLM, Ollama, Text Generation Inference, or any custom backend in a single router configuration. Set models to `manual` state when using non-llama.cpp backends to prevent automatic cache rescans from removing them.
62
+
63
+
### Future: WebUI Administration (TODO)
64
+
65
+
The admin API endpoints (`/admin/reload`, `/admin/rescan`) are designed to support hot configuration and model management. A future WebUI will enable:
66
+
67
+
-**Live model downloads** from HuggingFace directly through the interface
68
+
-**Hot reconfiguration** of `default_spawn` and per-model settings without restart
69
+
-**Real-time monitoring** of running instances and resource usage
70
+
-**Interactive model management** (add, remove, customize spawn parameters)
71
+
72
+
This aligns with the project philosophy: **everything configurable at runtime, zero downtime required**. The current CLI and JSON-based workflow is production-ready; the WebUI will provide a more accessible interface to the same underlying admin API.
73
+
31
74
---
32
75
33
76
## Quick Start
@@ -53,8 +96,14 @@ Simply launch the router:
53
96
54
97
On first run, it will:
55
98
1. Create a default configuration at `~/.config/llama.cpp/router-config.json`
56
-
2. Scan the Hugging Face cache (`~/.cache/huggingface/hub/`) for existing models
57
-
3. Start listening on `127.0.0.1:8082`
99
+
2. Scan the Hugging Face cache (`~/.cache/llama.cpp/`) for GGUF models
100
+
3. Add discovered models as `auto` state, inheriting `default_spawn` configuration
101
+
4. Start listening on `127.0.0.1:8082`
102
+
103
+
On every subsequent startup:
104
+
- Automatic rescan updates the model list (adds new, removes deleted cache files)
105
+
- All `auto` models inherit the current `default_spawn` settings
106
+
-`manual` models preserve their custom configurations
58
107
59
108
---
60
109
@@ -126,13 +175,21 @@ The import process:
126
175
127
176
The router tracks each model's origin through a `state` field, which controls behavior during rescans:
128
177
178
+
### Important: On-Demand Spawning
179
+
180
+
**All models spawn only when first requested via the API.** The router never starts backends at boot. The `auto`/`manual` state controls only rescan behavior:
181
+
182
+
-`auto`: Managed by cache scanner, inherits `default_spawn`
183
+
-`manual`: Protected from rescans, can have custom `spawn` configuration
184
+
129
185
### `auto` State
130
186
131
187
Models discovered automatically from the Hugging Face cache are marked as `"state": "auto"`. These models:
132
188
133
189
- Are added when first discovered in the cache
134
190
- Are **removed automatically** if the cached file disappears (e.g., cache cleanup)
135
191
- Are re-added if the file reappears
192
+
-**Inherit `default_spawn` configuration** - change `default_spawn` to optimize all `auto` models at once
136
193
137
194
This enables seamless synchronization with `huggingface-cli` downloads and cache management.
138
195
@@ -143,26 +200,13 @@ Models added via `--import-dir` or edited by hand in the config are marked as `"
143
200
- Are **never automatically removed**, even if the file path becomes invalid
144
201
- Must be manually deleted from the configuration
145
202
- Survive rescans and configuration reloads
203
+
-**Can have custom `spawn` configurations** that override `default_spawn`
146
204
147
205
**Use cases for manual state:**
148
206
- Models on network storage that may be temporarily unavailable
149
207
- Fine-tuned models in development directories
150
208
- Models you want to persist regardless of file system changes
151
209
152
-
### Changing State
153
-
154
-
Edit the configuration file directly to change a model's state:
155
-
156
-
```json
157
-
{
158
-
"name": "my-model.gguf",
159
-
"path": "/path/to/my-model.gguf",
160
-
"state": "manual"
161
-
}
162
-
```
163
-
164
-
Or set to `"auto"` if you want the router to manage its lifecycle.
@@ -234,6 +278,43 @@ The router automatically appends these arguments:
234
278
-`--port <port>` - Dynamically assigned port
235
279
-`--host 127.0.0.1` - Localhost binding for security
236
280
281
+
### Optimizing for Your Hardware
282
+
283
+
The `default_spawn` is where you tune performance for your specific hardware. **All `auto` models inherit these settings**, so you can optimize once for your entire collection:
-`-ctk q8_0 -ctv q8_0`: Quantize KV cache to Q8 for lower VRAM usage
308
+
-`-fa on`: Enable Flash Attention
309
+
-`--mlock`: Lock model in RAM to prevent swapping
310
+
-`-np 4`: Process 4 prompts in parallel
311
+
-`-kvu`: Use single unified KV buffer for all sequences (also `--kv-unified`)
312
+
-`--jinja`: Enable Jinja template support
313
+
314
+
**Note:** The router automatically appends `--model`, `--port`, and `--host` - do not include these in your command.
315
+
316
+
Change `default_spawn`, reload the router, and all `auto` models instantly use the new configuration.
317
+
237
318
### Per-Model Spawn Override
238
319
239
320
Individual models can override the default spawn configuration:
@@ -258,25 +339,29 @@ Individual models can override the default spawn configuration:
258
339
259
340
### Model Groups
260
341
261
-
Groups ensure mutual exclusivity - when a model from a group is requested, any running model from a **different** group is stopped first:
342
+
**Default Behavior: Single Model at a Time**
343
+
344
+
llama-router is designed for resource-constrained environments (small GPUs, consumer hardware). By default, **only ONE model runs at a time** - when you request a different model, the current one is stopped first. This ensures reliable operation on systems with limited VRAM.
345
+
346
+
To allow multiple models to run simultaneously, assign the **same group** to models that can coexist:
262
347
263
348
```json
264
349
{
265
350
"models": [
266
351
{
267
352
"name": "qwen3-8b-q4",
268
353
"path": "/path/to/qwen3-8b-q4.gguf",
269
-
"group": "8b-models"
354
+
"group": "small-models"
270
355
},
271
356
{
272
357
"name": "qwen3-8b-q8",
273
358
"path": "/path/to/qwen3-8b-q8.gguf",
274
-
"group": "8b-models"
359
+
"group": "small-models"
275
360
},
276
361
{
277
362
"name": "llama-70b-q4",
278
363
"path": "/path/to/llama-70b-q4.gguf",
279
-
"group": "70b-models"
364
+
"group": "large-model"
280
365
}
281
366
]
282
367
}
@@ -286,7 +371,7 @@ Behavior:
286
371
- Requesting `qwen3-8b-q4` while `qwen3-8b-q8` is running: **no restart** (same group)
287
372
- Requesting `llama-70b-q4` while `qwen3-8b-q4` is running: **stops qwen3, starts llama** (different group)
288
373
289
-
If no group is specified, the model name is used as a singleton group.
374
+
**Omitting the `group` field creates an exclusive singleton per model** - each model stops all others before starting.
1.**Capture by value**: Lambda captures must copy request data (headers, path, body), not reference stack variables that become invalid after the handler returns
565
+
2.**Explicit termination**: Call `sink.done()` followed by `return false` to signal httplib to close the connection properly - without this, streams deliver tokens correctly but never terminate
566
+
567
+
### PATH Binary Resolution
568
+
569
+
Spawn commands support both absolute/relative paths and PATH-based binaries:
570
+
571
+
-**Paths with separators**: `/usr/bin/llama-server`, `./llama-server`, `C:\llama\server.exe` - existence validated before spawn
572
+
-**PATH binaries**: `python`, `vllm`, `ollama`, `llama-server` - no validation, relies on shell PATH resolution
573
+
574
+
The router only validates file existence for commands containing `/` or `\\` path separators, allowing seamless use of system-installed binaries.
575
+
576
+
### Model-Scoped Route Stripping
577
+
578
+
Routes like `/<model>/health` are router-side aliases for convenience. Before proxying to the backend, the router strips the model prefix:
579
+
580
+
- User request: `GET /Qwen3-8B-Q4_K_M.gguf/health`
581
+
- Forwarded to backend: `GET /health`
582
+
583
+
Backends remain unaware of model-scoped routing - they expose standard endpoints like `/health`, `/v1/chat/completions`, etc.
584
+
585
+
### HTTP Header Management
586
+
587
+
The router strips `Content-Length` and `Transfer-Encoding` headers before forwarding requests. This is standard reverse-proxy behavior to handle chunked requests/responses properly and avoid conflicts when the proxy re-chunks data.
588
+
589
+
All other headers are forwarded transparently to preserve client context (authentication, user-agent, etc.).
590
+
591
+
### Health Endpoint Purpose
592
+
593
+
The `health_endpoint` configuration field serves **spawn readiness polling only** - the router uses it to detect when a backend has finished loading and is ready to serve requests.
594
+
595
+
This is separate from user-facing health routes. Clients can still call `/<model>/health` or `/health` for their own monitoring needs. The backend must expose standard endpoints regardless of what `health_endpoint` is configured for polling.
596
+
597
+
### Multimodal Projector Priority
598
+
599
+
When importing collections with `--import-dir`, mmproj files are automatically detected with this search priority:
600
+
601
+
1.`*-bf16.gguf` (selected first)
602
+
2.`*-f16.gguf` (selected if BF16 not found)
603
+
3.`*-f32.gguf` (selected if neither BF16 nor F16 found)
604
+
605
+
All quantization variants of a model (Q4_K_M, Q5_K_M, Q6_K, etc.) found in the same directory share the same mmproj file.
606
+
607
+
**For manual models:** mmproj auto-detection applies only during initial import. You can edit `spawn.command` to remove `--mmproj` if unwanted - your changes persist across restarts. Only `auto` models get their spawn configuration regenerated on rescan.
608
+
609
+
### Manifest Robustness
610
+
611
+
The HF cache scanner gracefully handles missing or corrupted manifest files:
612
+
613
+
- If `~/.cache/llama.cpp/` doesn't exist, scanner returns empty mapping
614
+
- If individual manifest files are missing, they're silently skipped
615
+
- Models without manifest entries load successfully, just without mmproj auto-detection
0 commit comments