|
| 1 | +# llama-router Architecture |
| 2 | + |
| 3 | +Technical documentation for developers and contributors. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Design Philosophy |
| 8 | + |
| 9 | +llama-router follows KISS (Keep It Simple, Stupid) principles: |
| 10 | + |
| 11 | +- **Minimal configuration**: Works out-of-box with HF cache scanning |
| 12 | +- **Explicit persistence**: Config changes are written explicitly via admin endpoints, never hidden in business logic |
| 13 | +- **Separation of concerns**: Core routing logic (`RouterApp`) has zero I/O, persistence handled by admin layer |
| 14 | +- **Simple endpoint matching**: Prefix-based matching, no complex regex |
| 15 | +- **Transparent proxy**: Headers and streaming forwarded as-is |
| 16 | +- **On-demand only**: No models start at boot, everything spawns when first requested |
| 17 | + |
| 18 | +### The auto + default_spawn Workflow |
| 19 | + |
| 20 | +Models discovered from the HuggingFace cache are marked as `auto` and inherit the `default_spawn` configuration. This creates a powerful optimization pattern: |
| 21 | + |
| 22 | +1. **Tune `default_spawn` once** with your preferred parameters (GPU layers, KV cache quantization, context size, etc.) |
| 23 | +2. **All `auto` models automatically use these settings** - no per-model configuration needed |
| 24 | +3. **Change `default_spawn` and reload** - all `auto` models instantly updated |
| 25 | +4. **Customize individual models** by switching to `manual` state first to prevent rescan overwrites |
| 26 | + |
| 27 | +This ensures consistent, optimized behavior across your entire model collection while allowing per-model overrides when needed. **Always set models to `manual` before customizing their spawn parameters** - otherwise your changes will be lost on the next rescan. |
| 28 | + |
| 29 | +## Multi-Engine Support |
| 30 | + |
| 31 | +llama-router is engine-agnostic. Any OpenAI-compatible inference backend can be orchestrated by configuring the appropriate spawn command and endpoints. The router simply: |
| 32 | + |
| 33 | +1. Spawns the command specified in `spawn.command` |
| 34 | +2. Polls `health_endpoint` until it returns HTTP 200 (customizable per backend) |
| 35 | +3. Proxies requests matching `proxy_endpoints` to the running instance |
| 36 | + |
| 37 | +This design allows you to mix llama.cpp, vLLM, Ollama, Text Generation Inference, or any custom backend in a single router configuration. Set models to `manual` state when using non-llama.cpp backends to prevent automatic cache rescans from removing them. |
| 38 | + |
| 39 | +### Future: WebUI Administration (TODO) |
| 40 | + |
| 41 | +The admin API endpoints (`/admin/reload`, `/admin/rescan`) are designed to support hot configuration and model management. A future WebUI will enable: |
| 42 | + |
| 43 | +- **Live model downloads** from HuggingFace directly through the interface |
| 44 | +- **Hot reconfiguration** of `default_spawn` and per-model settings without restart |
| 45 | +- **Real-time monitoring** of running instances and resource usage |
| 46 | +- **Interactive model management** (add, remove, customize spawn parameters) |
| 47 | + |
| 48 | +This aligns with the project philosophy: **everything configurable at runtime, zero downtime required**. The current CLI and JSON-based workflow is production-ready; the WebUI will provide a more accessible interface to the same underlying admin API. |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## Architecture |
| 53 | + |
| 54 | +### System Architecture |
| 55 | + |
| 56 | +``` |
| 57 | +┌─────────────────────────────────────────────────────────────┐ |
| 58 | +│ llama-router │ |
| 59 | +│ (port 8082) │ |
| 60 | +├─────────────────────────────────────────────────────────────┤ |
| 61 | +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ |
| 62 | +│ │ Config │ │ Scanner │ │ Process Manager │ │ |
| 63 | +│ │ Loader │ │ (HF cache) │ │ (spawn/terminate) │ │ |
| 64 | +│ └─────────────┘ └─────────────┘ └─────────────────────┘ │ |
| 65 | +│ │ │ |
| 66 | +│ ┌─────────────────────────┴────────────────────────────┐ │ |
| 67 | +│ │ HTTP Proxy │ │ |
| 68 | +│ │ (streaming support, header forwarding) │ │ |
| 69 | +│ └──────────────────────────────────────────────────────┘ │ |
| 70 | +└──────────────────────────┬──────────────────────────────────┘ |
| 71 | + │ |
| 72 | + ┌──────────────────┼──────────────────┐ |
| 73 | + ▼ ▼ ▼ |
| 74 | +┌───────────────┐ ┌───────────────┐ ┌───────────────┐ |
| 75 | +│ llama-server │ │ llama-server │ │ llama-server │ |
| 76 | +│ (port 50000) │ │ (port 50001) │ │ (port 50002) │ |
| 77 | +│ Model A │ │ Model B │ │ Model C │ |
| 78 | +└───────────────┘ └───────────────┘ └───────────────┘ |
| 79 | +``` |
| 80 | + |
| 81 | +### Request Flow |
| 82 | + |
| 83 | +1. Client sends POST to `/v1/chat/completions` with `"model": "ModelA"` |
| 84 | +2. Router checks if ModelA is already running |
| 85 | +3. If not running, or if a conflicting group is active: |
| 86 | + - Terminate conflicting backends |
| 87 | + - Spawn new llama-server with assigned port |
| 88 | + - Poll `/health` until ready (10s timeout) |
| 89 | +4. Forward request to backend, streaming response back to client |
| 90 | +5. Backend remains running for subsequent requests |
| 91 | + |
| 92 | +### Process Lifecycle |
| 93 | + |
| 94 | +- **Spawn**: `fork()`/`CreateProcess()` with stdout/stderr capture |
| 95 | +- **Health polling**: 200ms intervals, 10s timeout |
| 96 | +- **Graceful shutdown**: SIGTERM → 1s wait → SIGKILL |
| 97 | +- **Cleanup**: File descriptors closed, waitpid() called |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## File Structure & Separation of Concerns |
| 102 | + |
| 103 | +| Component | Files | Responsibility | |
| 104 | +|-----------|-------|----------------| |
| 105 | +| **Core** | `router-app.cpp/h` | Model lifecycle, spawn orchestration, group logic (zero I/O) | |
| 106 | +| **HTTP Endpoints** | `router-endpoints.cpp/h` | Public API routes (`/v1/models`, `/v1/chat/completions`) | |
| 107 | +| **Admin** | `router-admin.cpp/h` | Admin routes with explicit config persistence | |
| 108 | +| **Proxy** | `router-proxy.cpp/h` | HTTP forwarding, SSE streaming, header management | |
| 109 | +| **Process** | `router-process.cpp/h` | Cross-platform subprocess spawning, I/O capture | |
| 110 | +| **Config** | `router-config.cpp/h` | JSON load/write, rescan logic, `RescanResult` | |
| 111 | +| **Scanner** | `router-scanner.cpp/h` | HF cache discovery, `--import-dir`, mmproj detection | |
| 112 | +| **Main** | `router.cpp` | CLI parsing, server setup, signal handlers | |
| 113 | +| **Utils** | `logging.cpp/h`, `router-constants.h` | Shared logging and constants | |
| 114 | + |
| 115 | +**Design principles enforced:** |
| 116 | +- `router-app`: Pure business logic, no filesystem I/O |
| 117 | +- `router-admin`: Owns config persistence, explicit writes only |
| 118 | +- `router-proxy`: Streaming & forwarding, value-captured lambdas to avoid use-after-free |
| 119 | +- `router-process`: Platform abstraction, child processes never call parent logging functions |
| 120 | + |
| 121 | +--- |
| 122 | + |
| 123 | +## Technical Notes |
| 124 | + |
| 125 | +### Cross-Platform Process Management |
| 126 | + |
| 127 | +The router handles subprocess spawning differently per platform: |
| 128 | + |
| 129 | +**Linux/macOS:** Uses `fork()` + `execvp()` with careful attention to post-fork behavior. Child processes **must not** call logging functions that access parent singletons - they write directly to `STDERR_FILENO` instead to avoid use-after-fork crashes. |
| 130 | + |
| 131 | +**Windows:** Uses `CreateProcess()` with separate process information structures and handle management. |
| 132 | + |
| 133 | +### SSE Streaming Implementation |
| 134 | + |
| 135 | +Server-Sent Events streaming required careful lifetime management to avoid use-after-free bugs: |
| 136 | + |
| 137 | +1. **Capture by value**: Lambda captures must copy request data (headers, path, body), not reference stack variables that become invalid after the handler returns |
| 138 | +2. **Explicit termination**: Call `sink.done()` followed by `return false` to signal httplib to close the connection properly - without this, streams deliver tokens correctly but never terminate |
| 139 | + |
| 140 | +### PATH Binary Resolution |
| 141 | + |
| 142 | +Spawn commands support both absolute/relative paths and PATH-based binaries: |
| 143 | + |
| 144 | +- **Paths with separators**: `/usr/bin/llama-server`, `./llama-server`, `C:\llama\server.exe` - existence validated before spawn |
| 145 | +- **PATH binaries**: `python`, `vllm`, `ollama`, `llama-server` - no validation, relies on shell PATH resolution |
| 146 | + |
| 147 | +The router only validates file existence for commands containing `/` or `\\` path separators, allowing seamless use of system-installed binaries. |
| 148 | + |
| 149 | +### Model-Scoped Route Stripping |
| 150 | + |
| 151 | +Routes like `/<model>/health` are router-side aliases for convenience. Before proxying to the backend, the router strips the model prefix: |
| 152 | + |
| 153 | +- User request: `GET /Qwen3-8B-Q4_K_M.gguf/health` |
| 154 | +- Forwarded to backend: `GET /health` |
| 155 | + |
| 156 | +Backends remain unaware of model-scoped routing - they expose standard endpoints like `/health`, `/v1/chat/completions`, etc. |
| 157 | + |
| 158 | +### HTTP Header Management |
| 159 | + |
| 160 | +The router strips `Content-Length` and `Transfer-Encoding` headers before forwarding requests. This is standard reverse-proxy behavior to handle chunked requests/responses properly and avoid conflicts when the proxy re-chunks data. |
| 161 | + |
| 162 | +All other headers are forwarded transparently to preserve client context (authentication, user-agent, etc.). |
| 163 | + |
| 164 | +### Health Endpoint Purpose |
| 165 | + |
| 166 | +The `health_endpoint` configuration field serves **spawn readiness polling only** - the router uses it to detect when a backend has finished loading and is ready to serve requests. |
| 167 | + |
| 168 | +This is separate from user-facing health routes. Clients can still call `/<model>/health` or `/health` for their own monitoring needs. The backend must expose standard endpoints regardless of what `health_endpoint` is configured for polling. |
| 169 | + |
| 170 | +### Multimodal Projector Priority |
| 171 | + |
| 172 | +When importing collections with `--import-dir`, mmproj files are automatically detected with this search priority: |
| 173 | + |
| 174 | +1. `*-bf16.gguf` (selected first) |
| 175 | +2. `*-f16.gguf` (selected if BF16 not found) |
| 176 | +3. `*-f32.gguf` (selected if neither BF16 nor F16 found) |
| 177 | + |
| 178 | +All quantization variants of a model (Q4_K_M, Q5_K_M, Q6_K, etc.) found in the same directory share the same mmproj file. |
| 179 | + |
| 180 | +**For manual models:** mmproj auto-detection applies only during initial import. You can edit `spawn.command` to remove `--mmproj` if unwanted - your changes persist across restarts. Only `auto` models get their spawn configuration regenerated on rescan. |
| 181 | + |
| 182 | +### Manifest Robustness |
| 183 | + |
| 184 | +The HF cache scanner gracefully handles missing or corrupted manifest files: |
| 185 | + |
| 186 | +- If `~/.cache/llama.cpp/` doesn't exist, scanner returns empty mapping |
| 187 | +- If individual manifest files are missing, they're silently skipped |
| 188 | +- Models without manifest entries load successfully, just without mmproj auto-detection |
| 189 | + |
| 190 | +**Cache structure example:** |
| 191 | +``` |
| 192 | +~/.cache/llama.cpp/ |
| 193 | +├── bartowski_Qwen2.5-1.5B-Instruct-GGUF_Qwen2.5-1.5B-Instruct-Q4_K_M.gguf |
| 194 | +├── bartowski_Qwen2.5-1.5B-Instruct-GGUF_Qwen2.5-1.5B-Instruct-Q4_K_M.gguf.etag |
| 195 | +├── manifest=bartowski=Qwen2.5-1.5B-Instruct-GGUF=latest.json |
| 196 | +├── unsloth_Qwen3-VL-4B-Instruct-GGUF_Qwen3-VL-4B-Instruct-Q6_K.gguf |
| 197 | +├── unsloth_Qwen3-VL-4B-Instruct-GGUF_Qwen3-VL-4B-Instruct-Q6_K.gguf.etag |
| 198 | +├── unsloth_Qwen3-VL-4B-Instruct-GGUF_mmproj-F16.gguf |
| 199 | +├── unsloth_Qwen3-VL-4B-Instruct-GGUF_mmproj-F16.gguf.etag |
| 200 | +└── manifest=unsloth=Qwen3-VL-4B-Instruct-GGUF=Q6_K.json |
| 201 | +``` |
| 202 | + |
| 203 | +Manifest files (`manifest=vendor=repo=quant.json`) contain metadata for mmproj auto-detection. The scanner uses underscore separators: `vendor_repo_filename.gguf`. |
| 204 | + |
| 205 | +This ensures the router remains operational even with incomplete cache metadata. |
| 206 | + |
| 207 | +--- |
| 208 | + |
| 209 | +## Signals and Shutdown |
| 210 | + |
| 211 | +The router handles graceful shutdown on: |
| 212 | +- `SIGINT` (Ctrl+C) |
| 213 | +- `SIGTERM` |
| 214 | + |
| 215 | +Shutdown sequence: |
| 216 | +1. Stop accepting new connections |
| 217 | +2. Terminate all managed llama-server processes |
| 218 | +3. Wait for process cleanup |
| 219 | +4. Exit |
| 220 | + |
| 221 | +--- |
| 222 | + |
| 223 | +## Contributing |
| 224 | + |
| 225 | +llama-router is part of the llama.cpp project. Contributions welcome via pull request. |
| 226 | + |
| 227 | +## License |
| 228 | + |
| 229 | +MIT License - See llama.cpp repository for details. |
0 commit comments