Skip to content

Commit 109f11f

Browse files
llama-router: separate quick-start guide from technical architecture docs
1 parent f5af0c3 commit 109f11f

File tree

2 files changed

+235
-214
lines changed

2 files changed

+235
-214
lines changed

tools/router/ARCHITECTURE.md

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# llama-router Architecture
2+
3+
Technical documentation for developers and contributors.
4+
5+
---
6+
7+
## Design Philosophy
8+
9+
llama-router follows KISS (Keep It Simple, Stupid) principles:
10+
11+
- **Minimal configuration**: Works out-of-box with HF cache scanning
12+
- **Explicit persistence**: Config changes are written explicitly via admin endpoints, never hidden in business logic
13+
- **Separation of concerns**: Core routing logic (`RouterApp`) has zero I/O, persistence handled by admin layer
14+
- **Simple endpoint matching**: Prefix-based matching, no complex regex
15+
- **Transparent proxy**: Headers and streaming forwarded as-is
16+
- **On-demand only**: No models start at boot, everything spawns when first requested
17+
18+
### The auto + default_spawn Workflow
19+
20+
Models discovered from the HuggingFace cache are marked as `auto` and inherit the `default_spawn` configuration. This creates a powerful optimization pattern:
21+
22+
1. **Tune `default_spawn` once** with your preferred parameters (GPU layers, KV cache quantization, context size, etc.)
23+
2. **All `auto` models automatically use these settings** - no per-model configuration needed
24+
3. **Change `default_spawn` and reload** - all `auto` models instantly updated
25+
4. **Customize individual models** by switching to `manual` state first to prevent rescan overwrites
26+
27+
This ensures consistent, optimized behavior across your entire model collection while allowing per-model overrides when needed. **Always set models to `manual` before customizing their spawn parameters** - otherwise your changes will be lost on the next rescan.
28+
29+
## Multi-Engine Support
30+
31+
llama-router is engine-agnostic. Any OpenAI-compatible inference backend can be orchestrated by configuring the appropriate spawn command and endpoints. The router simply:
32+
33+
1. Spawns the command specified in `spawn.command`
34+
2. Polls `health_endpoint` until it returns HTTP 200 (customizable per backend)
35+
3. Proxies requests matching `proxy_endpoints` to the running instance
36+
37+
This design allows you to mix llama.cpp, vLLM, Ollama, Text Generation Inference, or any custom backend in a single router configuration. Set models to `manual` state when using non-llama.cpp backends to prevent automatic cache rescans from removing them.
38+
39+
### Future: WebUI Administration (TODO)
40+
41+
The admin API endpoints (`/admin/reload`, `/admin/rescan`) are designed to support hot configuration and model management. A future WebUI will enable:
42+
43+
- **Live model downloads** from HuggingFace directly through the interface
44+
- **Hot reconfiguration** of `default_spawn` and per-model settings without restart
45+
- **Real-time monitoring** of running instances and resource usage
46+
- **Interactive model management** (add, remove, customize spawn parameters)
47+
48+
This aligns with the project philosophy: **everything configurable at runtime, zero downtime required**. The current CLI and JSON-based workflow is production-ready; the WebUI will provide a more accessible interface to the same underlying admin API.
49+
50+
---
51+
52+
## Architecture
53+
54+
### System Architecture
55+
56+
```
57+
┌─────────────────────────────────────────────────────────────┐
58+
│ llama-router │
59+
│ (port 8082) │
60+
├─────────────────────────────────────────────────────────────┤
61+
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
62+
│ │ Config │ │ Scanner │ │ Process Manager │ │
63+
│ │ Loader │ │ (HF cache) │ │ (spawn/terminate) │ │
64+
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
65+
│ │ │
66+
│ ┌─────────────────────────┴────────────────────────────┐ │
67+
│ │ HTTP Proxy │ │
68+
│ │ (streaming support, header forwarding) │ │
69+
│ └──────────────────────────────────────────────────────┘ │
70+
└──────────────────────────┬──────────────────────────────────┘
71+
72+
┌──────────────────┼──────────────────┐
73+
▼ ▼ ▼
74+
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
75+
│ llama-server │ │ llama-server │ │ llama-server │
76+
│ (port 50000) │ │ (port 50001) │ │ (port 50002) │
77+
│ Model A │ │ Model B │ │ Model C │
78+
└───────────────┘ └───────────────┘ └───────────────┘
79+
```
80+
81+
### Request Flow
82+
83+
1. Client sends POST to `/v1/chat/completions` with `"model": "ModelA"`
84+
2. Router checks if ModelA is already running
85+
3. If not running, or if a conflicting group is active:
86+
- Terminate conflicting backends
87+
- Spawn new llama-server with assigned port
88+
- Poll `/health` until ready (10s timeout)
89+
4. Forward request to backend, streaming response back to client
90+
5. Backend remains running for subsequent requests
91+
92+
### Process Lifecycle
93+
94+
- **Spawn**: `fork()`/`CreateProcess()` with stdout/stderr capture
95+
- **Health polling**: 200ms intervals, 10s timeout
96+
- **Graceful shutdown**: SIGTERM → 1s wait → SIGKILL
97+
- **Cleanup**: File descriptors closed, waitpid() called
98+
99+
---
100+
101+
## File Structure & Separation of Concerns
102+
103+
| Component | Files | Responsibility |
104+
|-----------|-------|----------------|
105+
| **Core** | `router-app.cpp/h` | Model lifecycle, spawn orchestration, group logic (zero I/O) |
106+
| **HTTP Endpoints** | `router-endpoints.cpp/h` | Public API routes (`/v1/models`, `/v1/chat/completions`) |
107+
| **Admin** | `router-admin.cpp/h` | Admin routes with explicit config persistence |
108+
| **Proxy** | `router-proxy.cpp/h` | HTTP forwarding, SSE streaming, header management |
109+
| **Process** | `router-process.cpp/h` | Cross-platform subprocess spawning, I/O capture |
110+
| **Config** | `router-config.cpp/h` | JSON load/write, rescan logic, `RescanResult` |
111+
| **Scanner** | `router-scanner.cpp/h` | HF cache discovery, `--import-dir`, mmproj detection |
112+
| **Main** | `router.cpp` | CLI parsing, server setup, signal handlers |
113+
| **Utils** | `logging.cpp/h`, `router-constants.h` | Shared logging and constants |
114+
115+
**Design principles enforced:**
116+
- `router-app`: Pure business logic, no filesystem I/O
117+
- `router-admin`: Owns config persistence, explicit writes only
118+
- `router-proxy`: Streaming & forwarding, value-captured lambdas to avoid use-after-free
119+
- `router-process`: Platform abstraction, child processes never call parent logging functions
120+
121+
---
122+
123+
## Technical Notes
124+
125+
### Cross-Platform Process Management
126+
127+
The router handles subprocess spawning differently per platform:
128+
129+
**Linux/macOS:** Uses `fork()` + `execvp()` with careful attention to post-fork behavior. Child processes **must not** call logging functions that access parent singletons - they write directly to `STDERR_FILENO` instead to avoid use-after-fork crashes.
130+
131+
**Windows:** Uses `CreateProcess()` with separate process information structures and handle management.
132+
133+
### SSE Streaming Implementation
134+
135+
Server-Sent Events streaming required careful lifetime management to avoid use-after-free bugs:
136+
137+
1. **Capture by value**: Lambda captures must copy request data (headers, path, body), not reference stack variables that become invalid after the handler returns
138+
2. **Explicit termination**: Call `sink.done()` followed by `return false` to signal httplib to close the connection properly - without this, streams deliver tokens correctly but never terminate
139+
140+
### PATH Binary Resolution
141+
142+
Spawn commands support both absolute/relative paths and PATH-based binaries:
143+
144+
- **Paths with separators**: `/usr/bin/llama-server`, `./llama-server`, `C:\llama\server.exe` - existence validated before spawn
145+
- **PATH binaries**: `python`, `vllm`, `ollama`, `llama-server` - no validation, relies on shell PATH resolution
146+
147+
The router only validates file existence for commands containing `/` or `\\` path separators, allowing seamless use of system-installed binaries.
148+
149+
### Model-Scoped Route Stripping
150+
151+
Routes like `/<model>/health` are router-side aliases for convenience. Before proxying to the backend, the router strips the model prefix:
152+
153+
- User request: `GET /Qwen3-8B-Q4_K_M.gguf/health`
154+
- Forwarded to backend: `GET /health`
155+
156+
Backends remain unaware of model-scoped routing - they expose standard endpoints like `/health`, `/v1/chat/completions`, etc.
157+
158+
### HTTP Header Management
159+
160+
The router strips `Content-Length` and `Transfer-Encoding` headers before forwarding requests. This is standard reverse-proxy behavior to handle chunked requests/responses properly and avoid conflicts when the proxy re-chunks data.
161+
162+
All other headers are forwarded transparently to preserve client context (authentication, user-agent, etc.).
163+
164+
### Health Endpoint Purpose
165+
166+
The `health_endpoint` configuration field serves **spawn readiness polling only** - the router uses it to detect when a backend has finished loading and is ready to serve requests.
167+
168+
This is separate from user-facing health routes. Clients can still call `/<model>/health` or `/health` for their own monitoring needs. The backend must expose standard endpoints regardless of what `health_endpoint` is configured for polling.
169+
170+
### Multimodal Projector Priority
171+
172+
When importing collections with `--import-dir`, mmproj files are automatically detected with this search priority:
173+
174+
1. `*-bf16.gguf` (selected first)
175+
2. `*-f16.gguf` (selected if BF16 not found)
176+
3. `*-f32.gguf` (selected if neither BF16 nor F16 found)
177+
178+
All quantization variants of a model (Q4_K_M, Q5_K_M, Q6_K, etc.) found in the same directory share the same mmproj file.
179+
180+
**For manual models:** mmproj auto-detection applies only during initial import. You can edit `spawn.command` to remove `--mmproj` if unwanted - your changes persist across restarts. Only `auto` models get their spawn configuration regenerated on rescan.
181+
182+
### Manifest Robustness
183+
184+
The HF cache scanner gracefully handles missing or corrupted manifest files:
185+
186+
- If `~/.cache/llama.cpp/` doesn't exist, scanner returns empty mapping
187+
- If individual manifest files are missing, they're silently skipped
188+
- Models without manifest entries load successfully, just without mmproj auto-detection
189+
190+
**Cache structure example:**
191+
```
192+
~/.cache/llama.cpp/
193+
├── bartowski_Qwen2.5-1.5B-Instruct-GGUF_Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
194+
├── bartowski_Qwen2.5-1.5B-Instruct-GGUF_Qwen2.5-1.5B-Instruct-Q4_K_M.gguf.etag
195+
├── manifest=bartowski=Qwen2.5-1.5B-Instruct-GGUF=latest.json
196+
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_Qwen3-VL-4B-Instruct-Q6_K.gguf
197+
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_Qwen3-VL-4B-Instruct-Q6_K.gguf.etag
198+
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_mmproj-F16.gguf
199+
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_mmproj-F16.gguf.etag
200+
└── manifest=unsloth=Qwen3-VL-4B-Instruct-GGUF=Q6_K.json
201+
```
202+
203+
Manifest files (`manifest=vendor=repo=quant.json`) contain metadata for mmproj auto-detection. The scanner uses underscore separators: `vendor_repo_filename.gguf`.
204+
205+
This ensures the router remains operational even with incomplete cache metadata.
206+
207+
---
208+
209+
## Signals and Shutdown
210+
211+
The router handles graceful shutdown on:
212+
- `SIGINT` (Ctrl+C)
213+
- `SIGTERM`
214+
215+
Shutdown sequence:
216+
1. Stop accepting new connections
217+
2. Terminate all managed llama-server processes
218+
3. Wait for process cleanup
219+
4. Exit
220+
221+
---
222+
223+
## Contributing
224+
225+
llama-router is part of the llama.cpp project. Contributions welcome via pull request.
226+
227+
## License
228+
229+
MIT License - See llama.cpp repository for details.

0 commit comments

Comments
 (0)