Skip to content

Commit 90869bd

Browse files
committed
docs: add runner images documentation and missing tests
- Add website/docs/runners.md covering runner image creation, usage with direct URLs and HuggingFace repos, CUDA support, volume caching, kubeairunway integration, and pre-built images - Update specs-inference.md to document runner mode (backends without models) and link to the new runner docs - Add runners.md to sidebar under Features - Add test for model config generation after GGUF download
1 parent bba6c6b commit 90869bd

File tree

4 files changed

+145
-1
lines changed

4 files changed

+145
-1
lines changed

pkg/aikit2llb/inference/runner_test.go

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,25 @@ func TestGenerateRunnerScriptArgParser(t *testing.T) {
185185
}
186186
}
187187

188+
func TestGenerateRunnerScriptModelConfig(t *testing.T) {
189+
config := &config.InferenceConfig{
190+
Backends: []string{utils.BackendLlamaCpp},
191+
}
192+
193+
script := generateRunnerScript(config)
194+
195+
// Should generate a model config YAML after downloading GGUF
196+
if !strings.Contains(script, "backend: llama-cpp") {
197+
t.Error("should generate a model config with llama-cpp backend")
198+
}
199+
if !strings.Contains(script, "parameters:") {
200+
t.Error("should include parameters section in generated config")
201+
}
202+
if !strings.Contains(script, ".yaml") {
203+
t.Error("should write a .yaml config file")
204+
}
205+
}
206+
188207
func TestGenerateRunnerScriptUsageMessage(t *testing.T) {
189208
config := &config.InferenceConfig{
190209
Backends: []string{utils.BackendLlamaCpp},

website/docs/runners.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
title: Runner Images
3+
---
4+
5+
Runner images are lightweight AIKit images that download models at container startup instead of embedding them at build time. This is useful when you want a single reusable image that can serve different models without rebuilding.
6+
7+
## Creating a Runner Image
8+
9+
Define an aikitfile with `backends` but **no `models`**:
10+
11+
```yaml
12+
#syntax=ghcr.io/kaito-project/aikit/aikit:latest
13+
apiVersion: v1alpha1
14+
backends:
15+
- llama-cpp
16+
```
17+
18+
Build it:
19+
20+
```bash
21+
docker buildx build -t my-runner -f runner.yaml .
22+
```
23+
24+
## Running with a Model
25+
26+
Pass the model reference as a container argument:
27+
28+
```bash
29+
# Direct URL to a specific GGUF file (recommended for CI and reproducibility)
30+
docker run -p 8080:8080 my-runner https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf
31+
32+
# HuggingFace repo (downloads all GGUF files in the repo)
33+
docker run -p 8080:8080 my-runner unsloth/gemma-3-1b-it-GGUF
34+
35+
# With --model flag
36+
docker run -p 8080:8080 my-runner --model unsloth/gemma-3-1b-it-GGUF
37+
```
38+
39+
:::tip
40+
For HuggingFace repos with many quantization variants, use a **direct URL** to a specific file to avoid downloading all variants.
41+
:::
42+
43+
## Supported Backends
44+
45+
| Backend | Description |
46+
|---|---|
47+
| `llama-cpp` | GGUF models via llama.cpp (CPU or CUDA) |
48+
| `diffusers` | HuggingFace diffusers models (requires CUDA) |
49+
| `vllm` | HuggingFace safetensors models via vLLM (requires CUDA) |
50+
51+
## CUDA Runner Images
52+
53+
For GPU-accelerated inference, add `runtime: cuda`:
54+
55+
```yaml
56+
#syntax=ghcr.io/kaito-project/aikit/aikit:latest
57+
apiVersion: v1alpha1
58+
runtime: cuda
59+
backends:
60+
- llama-cpp
61+
```
62+
63+
Run with GPU support:
64+
65+
```bash
66+
docker run --gpus all -p 8080:8080 my-runner https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf
67+
```
68+
69+
:::note
70+
CUDA runner images include a CPU fallback — if no NVIDIA GPU is detected at runtime, the image automatically uses the CPU backend.
71+
:::
72+
73+
## Environment Variables
74+
75+
| Variable | Description |
76+
|---|---|
77+
| `HF_TOKEN` | HuggingFace token for gated models |
78+
79+
```bash
80+
docker run -e HF_TOKEN=hf_xxx -p 8080:8080 my-runner meta-llama/Llama-3.2-1B-Instruct-GGUF
81+
```
82+
83+
## Volume Caching
84+
85+
Mount a volume to `/models` to cache downloaded models across container restarts:
86+
87+
```bash
88+
docker run -v models:/models -p 8080:8080 my-runner https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf
89+
```
90+
91+
The runner detects when a different model is requested and re-downloads automatically.
92+
93+
## Kubernetes / kubeairunway
94+
95+
Runner images are compatible with [kubeairunway](https://github.com/kaito-project/kubeairunway). The `huggingface://` URI scheme used by kubeairunway is automatically handled:
96+
97+
```yaml
98+
apiVersion: kubeairunway.ai/v1alpha1
99+
kind: ModelDeployment
100+
metadata:
101+
name: gemma-cpu
102+
spec:
103+
model:
104+
id: "google/gemma-3-1b-it-qat-q8_0-gguf"
105+
source: huggingface
106+
engine:
107+
type: llamacpp
108+
image: "ghcr.io/kaito-project/aikit/runners/llama-cpp-cpu:latest"
109+
```
110+
111+
## Pre-built Runner Images
112+
113+
Pre-built runner images are available at `ghcr.io/kaito-project/aikit/runners/`:
114+
115+
| Image | Description |
116+
|---|---|
117+
| `ghcr.io/kaito-project/aikit/runners/llama-cpp-cpu` | CPU-only llama.cpp runner |
118+
| `ghcr.io/kaito-project/aikit/runners/llama-cpp-cuda` | CUDA + CPU fallback llama.cpp runner |
119+
| `ghcr.io/kaito-project/aikit/runners/diffusers-cuda` | CUDA diffusers runner |
120+
| `ghcr.io/kaito-project/aikit/runners/vllm-cuda` | CUDA vLLM runner |

website/docs/specs-inference.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ apiVersion: # required. only v1alpha1 is supported at the moment
99
debug: # optional. if set to true, debug logs will be printed
1010
runtime: # optional. defaults to avx. can be "avx", "avx2", "avx512", "cuda"
1111
backends: # optional. list of additional backends. can be "llama-cpp" (default), "diffusers", "vllm"
12-
models: # required. list of models to build
12+
models: # optional. list of models to build. omit for runner mode (see runners.md)
1313
- name: # required. name of the model
1414
source: # required. source of the model. can be a url or a local file
1515
sha256: # optional. sha256 hash of the model file
@@ -19,6 +19,10 @@ models: # required. list of models to build
1919
config: # optional. list of config files
2020
```
2121
22+
:::tip
23+
When `backends` is specified without `models`, a **runner image** is created that downloads models at container startup. See [Runner Images](runners.md) for details.
24+
:::
25+
2226
Example:
2327

2428
```yaml

website/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ const sidebars = {
3131
collapsed: false,
3232
items: [
3333
'create-images',
34+
'runners',
3435
'fine-tune',
3536
'packaging',
3637
'vision',

0 commit comments

Comments
 (0)