Skip to content

Commit 37a4f63

Browse files
authored
server : add development documentation (ggml-org#17760)
* first draft * rewrite * update & remove duplicated sections
1 parent 2bc9693 commit 37a4f63

File tree

2 files changed

+170
-126
lines changed

2 files changed

+170
-126
lines changed

tools/server/README-dev.md

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# llama-server Development Documentation
2+
3+
This document provides an in-depth technical overview of `llama-server`, intended for maintainers and contributors.
4+
5+
If you are an end user consuming `llama-server` as a product, please refer to the main [README](./README.md) instead.
6+
7+
## Backend
8+
9+
### Overview
10+
11+
The server supports two primary operating modes:
12+
13+
- **Inference mode**: The default mode for performing inference with a single loaded GGUF model.
14+
- **Router mode**: Enables management of multiple inference server instances behind a single API endpoint. Requests are automatically routed to the appropriate backend instance based on the requested model.
15+
16+
The core architecture consists of the following components:
17+
18+
- `server_context`: Holds the primary inference state, including the main `llama_context` and all active slots.
19+
- `server_slot`: An abstraction over a single “sequence” in llama.cpp, responsible for managing individual parallel inference requests.
20+
- `server_routes`: Middleware layer between `server_context` and the HTTP interface; handles JSON parsing/formatting and request routing logic.
21+
- `server_http_context`: Implements the HTTP server using `cpp-httplib`.
22+
- `server_queue`: Thread-safe queue used by HTTP workers to submit new tasks to `server_context`.
23+
- `server_response`: Thread-safe queue used by `server_context` to return results to HTTP workers.
24+
- `server_response_reader`: Higher-level wrapper around the two queues above for cleaner code.
25+
- `server_task`: Unit of work pushed into `server_queue`.
26+
- `server_task_result`: Unit of result pushed into `server_response`.
27+
- `server_tokens`: Unified representation of token sequences (supports both text and multimodal tokens); used by `server_task` and `server_slot`.
28+
- `server_prompt_checkpoint`: For recurrent (e.g., RWKV) and SWA models, stores snapshots of KV cache state. Enables reuse when subsequent requests share the same prompt prefix, saving redundant computation.
29+
- `server_models`: Standalone component for managing multiple backend instances (used in router mode). It is completely independent of `server_context`.
30+
31+
```mermaid
32+
graph TD
33+
API_User <--> server_http_context
34+
server_http_context <-- router mode --> server_models
35+
server_http_context <-- inference mode --> server_routes
36+
server_routes -- server_task --> server_queue
37+
subgraph server_context
38+
server_queue --> server_slot
39+
server_slot -- server_task_result --> server_response
40+
server_slot[multiple server_slot]
41+
end
42+
server_response --> server_routes
43+
```
44+
45+
TODO: mention about how batching is handled by `server_slot`
46+
47+
### Thread Management
48+
49+
`server_context` runs on a dedicated single thread. Because it is single-threaded, heavy post-processing (especially after token generation) should be avoided, as it directly impacts multi-sequence throughput.
50+
51+
Each incoming HTTP request is handled by its own thread managed by the HTTP library. The following operations are performed in HTTP worker threads:
52+
53+
- JSON request parsing
54+
- Chat template application
55+
- Tokenization
56+
- Conversion of `server_task_result` into final JSON response
57+
- Error formatting into JSON
58+
- Tracking of partial/incremental responses (e.g., streaming tool calls or reasoning steps)
59+
60+
**Best practices to follow:**
61+
62+
- All JSON formatting and chat template logic must stay in the HTTP layer.
63+
- Avoid passing raw JSON between the HTTP layer and `server_slot`. Instead, parse everything into native C++ types as early as possible.
64+
65+
### Testing
66+
67+
`llama-server` includes an automated test suite based on `pytest`.
68+
69+
The framework automatically starts a `llama-server` instance, sends requests, and validates responses.
70+
71+
For detailed instructions, see the [test documentation](./tests/README.md).
72+
73+
### Notable Related PRs
74+
75+
- Initial server implementation: https://github.com/ggml-org/llama.cpp/pull/1443
76+
- Parallel decoding support: https://github.com/ggml-org/llama.cpp/pull/3228
77+
- Refactor introducing `server_queue` and `server_response`: https://github.com/ggml-org/llama.cpp/pull/5065
78+
- Reranking endpoint: https://github.com/ggml-org/llama.cpp/pull/9510
79+
- Multimodal model support (`libmtmd`): https://github.com/ggml-org/llama.cpp/pull/12898
80+
- Unified KV cache handling: https://github.com/ggml-org/llama.cpp/pull/16736
81+
- Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216
82+
- Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362
83+
- Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470
84+
85+
86+
87+
88+
## Web UI
89+
90+
The project includes a web-based user interface for interacting with `llama-server`. It supports both single-model (`MODEL` mode) and multi-model (`ROUTER` mode) operation.
91+
92+
The SvelteKit-based Web UI is introduced in this PR: https://github.com/ggml-org/llama.cpp/pull/14839
93+
94+
### Features
95+
96+
- **Chat interface** with streaming responses
97+
- **Multi-model support** (ROUTER mode) - switch between models, auto-load on selection
98+
- **Modality validation** - ensures selected model supports conversation's attachments (images, audio)
99+
- **Conversation management** - branching, regeneration, editing with history preservation
100+
- **Attachment support** - images, audio, PDFs (with vision/text fallback)
101+
- **Configurable parameters** - temperature, top_p, etc. synced with server defaults
102+
- **Dark/light theme**
103+
104+
### Tech Stack
105+
106+
- **SvelteKit** - frontend framework with Svelte 5 runes for reactive state
107+
- **TailwindCSS** + **shadcn-svelte** - styling and UI components
108+
- **Vite** - build tooling
109+
- **IndexedDB** (Dexie) - local storage for conversations
110+
- **LocalStorage** - user settings persistence
111+
112+
### Architecture
113+
114+
The WebUI follows a layered architecture:
115+
116+
```
117+
Routes → Components → Hooks → Stores → Services → Storage/API
118+
```
119+
120+
- **Stores** - reactive state management (`chatStore`, `conversationsStore`, `modelsStore`, `serverStore`, `settingsStore`)
121+
- **Services** - stateless API/database communication (`ChatService`, `ModelsService`, `PropsService`, `DatabaseService`)
122+
- **Hooks** - reusable logic (`useModelChangeValidation`, `useProcessingState`)
123+
124+
For detailed architecture diagrams, see [`tools/server/webui/docs/`](webui/docs/):
125+
126+
- `high-level-architecture.mmd` - full architecture with all modules
127+
- `high-level-architecture-simplified.mmd` - simplified overview
128+
- `data-flow-simplified-model-mode.mmd` - data flow for single-model mode
129+
- `data-flow-simplified-router-mode.mmd` - data flow for multi-model mode
130+
- `flows/*.mmd` - detailed per-domain flows (chat, conversations, models, etc.)
131+
132+
### Development
133+
134+
```sh
135+
# make sure you have Node.js installed
136+
cd tools/server/webui
137+
npm i
138+
139+
# run dev server (with hot reload)
140+
npm run dev
141+
142+
# run tests
143+
npm run test
144+
145+
# build production bundle
146+
npm run build
147+
```
148+
149+
After `public/index.html.gz` has been generated, rebuild `llama-server` as described in the [build](#build) section to include the updated UI.
150+
151+
**Note:** The Vite dev server automatically proxies API requests to `http://localhost:8080`. Make sure `llama-server` is running on that port during development.

tools/server/README.md

Lines changed: 19 additions & 126 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.
44

5-
Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
5+
Set of LLM REST APIs and a web UI to interact with llama.cpp.
66

77
**Features:**
88
* LLM inference of F16 and quantized models on GPU and CPU
@@ -19,7 +19,7 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
1919
* Speculative decoding
2020
* Easy-to-use web UI
2121

22-
The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggml-org/llama.cpp/issues/4216).
22+
For the ful list of features, please refer to [server's changelog](https://github.com/ggml-org/llama.cpp/issues/9291)
2323

2424
## Usage
2525

@@ -289,69 +289,6 @@ For more details, please refer to [multimodal documentation](../../docs/multimod
289289
cmake --build build --config Release -t llama-server
290290
```
291291

292-
## Web UI
293-
294-
The project includes a web-based user interface for interacting with `llama-server`. It supports both single-model (`MODEL` mode) and multi-model (`ROUTER` mode) operation.
295-
296-
### Features
297-
298-
- **Chat interface** with streaming responses
299-
- **Multi-model support** (ROUTER mode) - switch between models, auto-load on selection
300-
- **Modality validation** - ensures selected model supports conversation's attachments (images, audio)
301-
- **Conversation management** - branching, regeneration, editing with history preservation
302-
- **Attachment support** - images, audio, PDFs (with vision/text fallback)
303-
- **Configurable parameters** - temperature, top_p, etc. synced with server defaults
304-
- **Dark/light theme**
305-
306-
### Tech Stack
307-
308-
- **SvelteKit** - frontend framework with Svelte 5 runes for reactive state
309-
- **TailwindCSS** + **shadcn-svelte** - styling and UI components
310-
- **Vite** - build tooling
311-
- **IndexedDB** (Dexie) - local storage for conversations
312-
- **LocalStorage** - user settings persistence
313-
314-
### Architecture
315-
316-
The WebUI follows a layered architecture:
317-
318-
```
319-
Routes → Components → Hooks → Stores → Services → Storage/API
320-
```
321-
322-
- **Stores** - reactive state management (`chatStore`, `conversationsStore`, `modelsStore`, `serverStore`, `settingsStore`)
323-
- **Services** - stateless API/database communication (`ChatService`, `ModelsService`, `PropsService`, `DatabaseService`)
324-
- **Hooks** - reusable logic (`useModelChangeValidation`, `useProcessingState`)
325-
326-
For detailed architecture diagrams, see [`tools/server/webui/docs/`](webui/docs/):
327-
328-
- `high-level-architecture.mmd` - full architecture with all modules
329-
- `high-level-architecture-simplified.mmd` - simplified overview
330-
- `data-flow-simplified-model-mode.mmd` - data flow for single-model mode
331-
- `data-flow-simplified-router-mode.mmd` - data flow for multi-model mode
332-
- `flows/*.mmd` - detailed per-domain flows (chat, conversations, models, etc.)
333-
334-
### Development
335-
336-
```sh
337-
# make sure you have Node.js installed
338-
cd tools/server/webui
339-
npm i
340-
341-
# run dev server (with hot reload)
342-
npm run dev
343-
344-
# run tests
345-
npm run test
346-
347-
# build production bundle
348-
npm run build
349-
```
350-
351-
After `public/index.html.gz` has been generated, rebuild `llama-server` as described in the [build](#build) section to include the updated UI.
352-
353-
**Note:** The Vite dev server automatically proxies API requests to `http://localhost:8080`. Make sure `llama-server` is running on that port during development.
354-
355292
## Quick Start
356293

357294
To get started right away, run the following command, making sure to use the correct path for the model you have:
@@ -380,7 +317,7 @@ docker run -p 8080:8080 -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:se
380317
docker run -p 8080:8080 -v /path/to/models:/models --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99
381318
```
382319

383-
## Testing with CURL
320+
## Using with CURL
384321

385322
Using [curl](https://curl.se/). On Windows, `curl.exe` should be available in the base OS.
386323

@@ -391,46 +328,6 @@ curl --request POST \
391328
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
392329
```
393330

394-
## Advanced testing
395-
396-
We implemented a [server test framework](./tests/README.md) using human-readable scenario.
397-
398-
*Before submitting an issue, please try to reproduce it with this format.*
399-
400-
## Node JS Test
401-
402-
You need to have [Node.js](https://nodejs.org/en) installed.
403-
404-
```bash
405-
mkdir llama-client
406-
cd llama-client
407-
```
408-
409-
Create an index.js file and put this inside:
410-
411-
```javascript
412-
const prompt = "Building a website can be done in 10 simple steps:"
413-
414-
async function test() {
415-
let response = await fetch("http://127.0.0.1:8080/completion", {
416-
method: "POST",
417-
body: JSON.stringify({
418-
prompt,
419-
n_predict: 64,
420-
})
421-
})
422-
console.log((await response.json()).content)
423-
}
424-
425-
test()
426-
```
427-
428-
And run it:
429-
430-
```bash
431-
node index.js
432-
```
433-
434331
## API Endpoints
435332

436333
### GET `/health`: Returns health check result
@@ -1638,6 +1535,22 @@ Response:
16381535
}
16391536
```
16401537

1538+
## API errors
1539+
1540+
`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi
1541+
1542+
Example of an error:
1543+
1544+
```json
1545+
{
1546+
"error": {
1547+
"code": 401,
1548+
"message": "Invalid API Key",
1549+
"type": "authentication_error"
1550+
}
1551+
}
1552+
```
1553+
16411554
## More examples
16421555

16431556
### Interactive mode
@@ -1657,26 +1570,6 @@ Run with bash:
16571570
bash chat.sh
16581571
```
16591572

1660-
### OAI-like API
1661-
1662-
The HTTP `llama-server` supports an OAI-like API: https://github.com/openai/openai-openapi
1663-
1664-
### API errors
1665-
1666-
`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi
1667-
1668-
Example of an error:
1669-
1670-
```json
1671-
{
1672-
"error": {
1673-
"code": 401,
1674-
"message": "Invalid API Key",
1675-
"type": "authentication_error"
1676-
}
1677-
}
1678-
```
1679-
16801573
Apart from error types supported by OAI, we also have custom types that are specific to functionalities of llama.cpp:
16811574

16821575
**When /metrics or /slots endpoint is disabled**

0 commit comments

Comments
 (0)