An immutable, self-contained appliance that boots MicroShift with the vLLM Semantic Router and a local Small Language Model (SLM) pre-deployed. Simpler queries run on-device via the local GPU; complex queries route to external LLM endpoints — all through a single OpenAI-compatible API.
bootc image (CentOS Stream 10)
├── MicroShift (RPM, auto-starts on boot)
├── manifests.d/semantic-router/ ← semantic router (full or slim mode)
├── manifests.d/vllm-slm/ ← local SLM (Qwen2.5-1.5B on GPU)
├── Pre-pulled container images
├── NVIDIA Container Toolkit + CDI ← GPU runtime (systemd, on every boot)
├── GPU Operator (Helm, post-boot) ← device plugin + GPU feature discovery
├── /usr/local/bin/setup-gpu-operator.sh
├── /usr/local/bin/configure-semantic-router.sh
└── /etc/semantic-router/templates/
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ semantic-router namespace │ │ vllm-slm namespace │
│ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │
│ │ semantic-router Deploy │ │ │ │ vllm-slm Deployment │ │
│ │ ├─ extproc (routing) │ │ │ │ └─ vLLM container │ │
│ │ └─ envoy (proxy) ──────┼──┼────►│ │ Qwen2.5-1.5B │ │
│ └────────────────────────┘ │ │ │ port 8000 (OpenAI) │ │
│ NodePort 30801 (API) │ │ │ NVIDIA GPU │ │
└──────────────────────────────┘ │ └────────────────────────┘ │
│ │ NodePort 30500 (direct API) │
│ └──────────────────────────────┘
│
├───► External LLM (e.g. litellm.example.com, HTTPS)
└───► Local SLM (vllm-slm.vllm-slm.svc:8000, HTTP)
Three-stage boot flow:
- MicroShift starts → applies manifests → pods wait for config
- User runs
setup-gpu-operator.sh→ GPU becomes available → SLM pod starts - User runs
configure-semantic-router.sh→ creates ConfigMap + Secret → router starts
| Component | Namespace | Description |
|---|---|---|
| Semantic Router | semantic-router |
Routes queries to the right model based on domain classification |
| vLLM SLM | vllm-slm |
Local Qwen2.5-1.5B-Instruct served by vLLM on GPU |
| GPU Operator | gpu-operator |
NVIDIA device plugin + GPU feature discovery (Helm) |
| Mode | Components | Ports |
|---|---|---|
| full (default) | vllm-sr all-in-one + Grafana + Prometheus + SLM | API:30801, SLM:30500, Dashboard:30700, Grafana:30300 |
| slim | extproc + Envoy sidecar + SLM | API:30801, SLM:30500 |
podman build -t hybrid-inference-bootc:latest -f Containerfile .CI builds run automatically on push to main and publish multi-arch
(amd64 + arm64) manifest lists to
ghcr.io/<owner>/hybrid-inference-in-a-box:<tag>. See
.github/workflows/build-bootc.yaml.
Note
On first boot, infrastructure pods may show CreateContainerConfigError
(waiting for ConfigMap/Secret) and the vLLM SLM pod will show Pending
(waiting for GPU resources). This is expected.
Deploy via VM (qcow2), bare metal (ISO), or cloud (AMI). MicroShift starts automatically.
Quick start with KVM/libvirt:
# Full mode (8GB RAM, 4 vCPUs, 100GB disk)
./scripts/start-bootc-vm.sh
# Slim mode (4GB RAM, 2 vCPUs, 40GB disk)
./scripts/start-bootc-vm.sh --mode=slimThe GPU Operator installs the NVIDIA device plugin and GPU feature discovery. This is required for the SLM pod to access the GPU.
sudo setup-gpu-operator.shThis script:
- Configures CRI-O with the NVIDIA container runtime
- Generates CDI specs for GPU device injection
- Grants OpenShift SCCs to GPU Operator service accounts
- Installs the GPU Operator via Helm (driver + toolkit disabled, uses host drivers)
- Waits for
nvidia.com/gputo be advertised
Once the GPU is available, the vLLM SLM pod downloads the model from HuggingFace and starts serving. First boot takes a few minutes for the download.
sudo kubectl -n vllm-slm get pods -w
# Wait for READY 1/1
# Verify the model is serving
curl http://<IP>:30500/v1/modelsCopy the example config and edit it:
cp config/router.yaml.example router.yaml
vi router.yaml # edit endpoints, API keys, models
sudo configure-semantic-router.sh router.yamlLocal SLM only (simplest setup — no external endpoints needed):
providers:
models:
- name: "Qwen2.5-1.5B-Instruct"
endpoints:
- name: "local-vllm"
weight: 1
endpoint: "vllm-slm.vllm-slm.svc:8000"
protocol: "http"
access_key: "none"
default_model: "Qwen2.5-1.5B-Instruct"Hybrid (local SLM + external LLMs — edit endpoints and keys):
providers:
models:
- name: "Mistral-Small-24B-W8A8"
endpoints:
- name: "litellm"
weight: 1
endpoint: "litellm.example.com:443"
protocol: "https"
access_key: "sk-your-key-here"
- name: "Qwen2.5-1.5B-Instruct"
endpoints:
- name: "local-vllm"
weight: 1
endpoint: "vllm-slm.vllm-slm.svc:8000"
protocol: "http"
access_key: "none"
default_model: "Qwen2.5-1.5B-Instruct"sudo kubectl -n semantic-router get pods -wFull mode downloads ~18GB of classifier models on first boot. Slim mode downloads ~500MB.
| Endpoint | URL |
|---|---|
| Router API | http://<IP>:30801/v1/chat/completions |
| SLM direct | http://<IP>:30500/v1/chat/completions |
| Dashboard (full) | http://<IP>:30700 |
| Grafana (full) | http://<IP>:30300 |
# Simple query → routed to local SLM
curl -s http://<IP>:30801/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"auto","messages":[{"role":"user","content":"What is photosynthesis?"}]}' | jq .
# Coding query → routed to external model
curl -s http://<IP>:30801/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"auto","messages":[{"role":"user","content":"Write a Python quicksort"}]}' | jq .
# Direct SLM access (bypass router)
curl -s http://<IP>:30500/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"What is 2+2?"}]}' | jq .The host must have:
- NVIDIA GPU with drivers pre-installed
nvidia-container-toolkitpackage (baked into the bootc image)
The NVIDIA GB10 (Blackwell, CUDA capability 12.1) has unified memory shared with the CPU. The vLLM deployment accounts for this:
--gpu-memory-utilization 0.5— only uses 50% of reported GPU memory (the rest is shared with the system)--enforce-eager— disables Triton/torch.compile (the bundled ptxas doesn't supportsm_121ayet)
The generate-nvidia-cdi.sh systemd service runs on every boot before
MicroShift and:
- Configures CRI-O with the NVIDIA container runtime (
nvidia-ctk runtime configure) - Generates CDI specs at
/etc/cdi/nvidia.yaml
sudo select-mode.sh slim
sudo systemctl restart microshift
# Wait ~30s for MicroShift to restart
sudo configure-semantic-router.sh router.yamlEdit router.yaml and re-run configure-semantic-router.sh:
sudo configure-semantic-router.sh router.yaml| Baked in image (immutable) | Configured post-boot |
|---|---|
| Namespace, Deployments, Services | Model names (router.yaml) |
| Prometheus + Grafana (full mode) | LLM endpoint(s) and API key(s) |
| vLLM SLM deployment + container image | Default model |
| NVIDIA Container Toolkit + CDI service | GPU Operator (Helm, setup-gpu-operator.sh) |
| Helm binary | |
| Container images (pre-pulled) | |
| Firewall rules, systemd units | |
| Config templates |
hybrid-inference-in-a-box/
├── Containerfile
├── .github/workflows/
│ └── build-bootc.yaml ← CI/CD: build & push to GHCR
├── manifests/
│ ├── semantic-router/
│ │ ├── kustomization.yaml
│ │ ├── base/
│ │ │ ├── kustomization.yaml
│ │ │ └── namespace.yaml
│ │ └── overlays/
│ │ ├── full/ ← vllm-sr + grafana + prometheus
│ │ └── slim/ ← extproc + envoy sidecar
│ └── vllm-slm/
│ ├── kustomization.yaml
│ └── base/
│ ├── kustomization.yaml
│ ├── namespace.yaml
│ ├── deployment.yaml ← vLLM + Qwen2.5-1.5B on GPU
│ └── service.yaml ← NodePort 30500
├── config/
│ ├── router.yaml.example ← sample config (external + local models)
│ ├── llm-router-dashboard.json
│ └── templates/
│ ├── config-full.yaml.tmpl
│ ├── config-slim.yaml.tmpl
│ └── envoy-slim.yaml.tmpl
├── scripts/
│ ├── configure-semantic-router.sh ← post-boot router configuration
│ ├── setup-gpu-operator.sh ← install NVIDIA GPU Operator (Helm)
│ ├── generate-nvidia-cdi.sh ← CRI-O runtime + CDI specs (systemd)
│ ├── select-mode.sh ← switch full / slim
│ ├── start-bootc-vm.sh ← create VM from bootc image
│ ├── create-vg.sh ← loopback LVM VG for TopoLVM
│ └── make-rshared.service
└── README.md
Pods stuck in CreateContainerConfigError:
Run configure-semantic-router.sh — the pods are waiting for ConfigMap/Secret.
vLLM SLM pod stuck in Pending:
The GPU Operator hasn't advertised nvidia.com/gpu yet. Run
setup-gpu-operator.sh and check:
sudo kubectl get nodes -o jsonpath='{.items[0].status.allocatable}' | python3 -m json.tool | grep nvidiavLLM SLM crashes with "Free memory ... less than desired":
The default --gpu-memory-utilization is too high for unified memory GPUs.
Edit the deployment:
sudo kubectl -n vllm-slm edit deployment vllm-slm
# Lower --gpu-memory-utilization (default: 0.5, try 0.3)vLLM crashes with "ptxas fatal: Value 'sm_121a' is not defined":
The GPU architecture is too new for the bundled Triton. The deployment
includes --enforce-eager to work around this. If you removed it, add it
back.
GPU Operator pods stuck (SCC errors):
The setup-gpu-operator.sh script grants SCCs automatically. If you
installed manually, grant them:
oc adm policy add-scc-to-user privileged -n gpu-operator -z node-feature-discovery
oc adm policy add-scc-to-user privileged -n gpu-operator -z nvidia-device-plugin
# ... (see setup-gpu-operator.sh for the full list)TopoLVM pods in CrashLoopBackOff:
sudo systemctl status create-vg
sudo vgs myvg1MicroShift not starting:
sudo systemctl status microshift
sudo journalctl -u microshift --no-pager -lRouter not connecting to LLM endpoint:
curl -s https://<endpoint>/models -H 'Authorization: Bearer <key>'