|
| 1 | +--- |
| 2 | +title: Runner Images |
| 3 | +--- |
| 4 | + |
| 5 | +Runner images are lightweight AIKit images that download models at container startup instead of embedding them at build time. This is useful when you want a single reusable image that can serve different models without rebuilding. |
| 6 | + |
| 7 | +## Creating a Runner Image |
| 8 | + |
| 9 | +Define an aikitfile with `backends` but **no `models`**: |
| 10 | + |
| 11 | +```yaml |
| 12 | +#syntax=ghcr.io/kaito-project/aikit/aikit:latest |
| 13 | +apiVersion: v1alpha1 |
| 14 | +backends: |
| 15 | + - llama-cpp |
| 16 | +``` |
| 17 | +
|
| 18 | +Build it: |
| 19 | +
|
| 20 | +```bash |
| 21 | +docker buildx build -t my-runner -f runner.yaml . |
| 22 | +``` |
| 23 | + |
| 24 | +## Running with a Model |
| 25 | + |
| 26 | +Pass the model reference as a container argument: |
| 27 | + |
| 28 | +```bash |
| 29 | +# Direct URL to a specific GGUF file (recommended for CI and reproducibility) |
| 30 | +docker run -p 8080:8080 my-runner https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf |
| 31 | + |
| 32 | +# HuggingFace repo (downloads all GGUF files in the repo) |
| 33 | +docker run -p 8080:8080 my-runner unsloth/gemma-3-1b-it-GGUF |
| 34 | + |
| 35 | +# With --model flag |
| 36 | +docker run -p 8080:8080 my-runner --model unsloth/gemma-3-1b-it-GGUF |
| 37 | +``` |
| 38 | + |
| 39 | +:::tip |
| 40 | +For HuggingFace repos with many quantization variants, use a **direct URL** to a specific file to avoid downloading all variants. |
| 41 | +::: |
| 42 | + |
| 43 | +## Supported Backends |
| 44 | + |
| 45 | +| Backend | Description | |
| 46 | +|---|---| |
| 47 | +| `llama-cpp` | GGUF models via llama.cpp (CPU or CUDA) | |
| 48 | +| `diffusers` | HuggingFace diffusers models (requires CUDA) | |
| 49 | +| `vllm` | HuggingFace safetensors models via vLLM (requires CUDA) | |
| 50 | + |
| 51 | +## CUDA Runner Images |
| 52 | + |
| 53 | +For GPU-accelerated inference, add `runtime: cuda`: |
| 54 | + |
| 55 | +```yaml |
| 56 | +#syntax=ghcr.io/kaito-project/aikit/aikit:latest |
| 57 | +apiVersion: v1alpha1 |
| 58 | +runtime: cuda |
| 59 | +backends: |
| 60 | + - llama-cpp |
| 61 | +``` |
| 62 | +
|
| 63 | +Run with GPU support: |
| 64 | +
|
| 65 | +```bash |
| 66 | +docker run --gpus all -p 8080:8080 my-runner https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf |
| 67 | +``` |
| 68 | + |
| 69 | +:::note |
| 70 | +CUDA runner images include a CPU fallback — if no NVIDIA GPU is detected at runtime, the image automatically uses the CPU backend. |
| 71 | +::: |
| 72 | + |
| 73 | +## Environment Variables |
| 74 | + |
| 75 | +| Variable | Description | |
| 76 | +|---|---| |
| 77 | +| `HF_TOKEN` | HuggingFace token for gated models | |
| 78 | + |
| 79 | +```bash |
| 80 | +docker run -e HF_TOKEN=hf_xxx -p 8080:8080 my-runner meta-llama/Llama-3.2-1B-Instruct-GGUF |
| 81 | +``` |
| 82 | + |
| 83 | +## Volume Caching |
| 84 | + |
| 85 | +Mount a volume to `/models` to cache downloaded models across container restarts: |
| 86 | + |
| 87 | +```bash |
| 88 | +docker run -v models:/models -p 8080:8080 my-runner https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf |
| 89 | +``` |
| 90 | + |
| 91 | +The runner detects when a different model is requested and re-downloads automatically. |
| 92 | + |
| 93 | +## Kubernetes / kubeairunway |
| 94 | + |
| 95 | +Runner images are compatible with [kubeairunway](https://github.com/kaito-project/kubeairunway). The `huggingface://` URI scheme used by kubeairunway is automatically handled: |
| 96 | + |
| 97 | +```yaml |
| 98 | +apiVersion: kubeairunway.ai/v1alpha1 |
| 99 | +kind: ModelDeployment |
| 100 | +metadata: |
| 101 | + name: gemma-cpu |
| 102 | +spec: |
| 103 | + model: |
| 104 | + id: "google/gemma-3-1b-it-qat-q8_0-gguf" |
| 105 | + source: huggingface |
| 106 | + engine: |
| 107 | + type: llamacpp |
| 108 | + image: "ghcr.io/kaito-project/aikit/runners/llama-cpp-cpu:latest" |
| 109 | +``` |
| 110 | +
|
| 111 | +## Pre-built Runner Images |
| 112 | +
|
| 113 | +Pre-built runner images are available at `ghcr.io/kaito-project/aikit/runners/`: |
| 114 | + |
| 115 | +| Image | Description | |
| 116 | +|---|---| |
| 117 | +| `ghcr.io/kaito-project/aikit/runners/llama-cpp-cpu` | CPU-only llama.cpp runner | |
| 118 | +| `ghcr.io/kaito-project/aikit/runners/llama-cpp-cuda` | CUDA + CPU fallback llama.cpp runner | |
| 119 | +| `ghcr.io/kaito-project/aikit/runners/diffusers-cuda` | CUDA diffusers runner | |
| 120 | +| `ghcr.io/kaito-project/aikit/runners/vllm-cuda` | CUDA vLLM runner | |
0 commit comments