Local mode is a first-class path in this project, but models are not shipped inside Docker images. You must download and mount models into ./models (default) or provide equivalent paths via environment variables.
- Predictable “local stack” that boots reliably
- Clear expectations for CPU/RAM/GPU
- Explicit build profiles so contributors don’t accidentally ship multi-GB images
Use when you want “fully local” call handling with a predictable, smaller stack:
- STT: Vosk
- LLM: llama.cpp (GGUF) (e.g., Phi-3-mini)
- TTS: Piper
- No Sherpa, no Kokoro, no embedded Kroko
Run:
docker compose -f docker-compose.yml -f docker-compose.local-core.yml build local_ai_serverdocker compose up -d
Enable additional backends (Sherpa, Kokoro, embedded Kroko). This is heavier, increases build times, and may exceed CI runner disk limits.
Run:
docker compose build local_ai_server(default settings)
Uses Dockerfile.gpu for CUDA-enabled llama.cpp + Faster Whisper. Includes all backends from local-full plus CUDA support.
Run:
docker compose -f docker-compose.yml -f docker-compose.gpu.yml build local_ai_serverdocker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d local_ai_server
Prerequisites: NVIDIA GPU, nvidia-container-toolkit. See LOCAL_ONLY_SETUP.md for full GPU setup.
INCLUDE_KROKO_EMBEDDED is off by default (lighter image). Enable it only if you specifically want the embedded Kroko ONNX websocket server inside local_ai_server.
- Enable for a build:
INCLUDE_KROKO_EMBEDDED=true docker compose build local_ai_server
- CPU-only “core”: expect multi-second LLM turns on small CPUs; prioritize fewer concurrent local calls.
- GPU (if used): large variance by GPU + model; tune
LOCAL_LLM_GPU_LAYERSand thread counts. - RAM: ensure the model(s) you choose fit comfortably; leave headroom for Docker + Asterisk.
These are approximate and vary by model variant/quantization and OS memory behavior. Verify with du -sh on your actual model files.
| Component | Example default | Disk (approx) | RAM (approx) | Notes |
|---|---|---|---|---|
| STT | Vosk vosk-model-en-us-0.22 |
~1.5–2.0 GB | ~0.5–1.0 GB | Larger models improve accuracy; CPU-only |
| LLM | Phi-3-mini GGUF Q4_K_M |
~2.0–2.6 GB | ~3–5 GB | CPU-only can be slow; GPU layers reduce CPU load |
| TTS | Piper en_US-lessac-medium.onnx |
~50–150 MB | ~0.2–0.6 GB | Voice/model dependent |
| Total (core) | Vosk + Phi-3-mini + Piper | ~3.6–4.8 GB | ~4–7 GB | Add headroom for Docker + Asterisk |
| Component | Example default | Disk (approx) | RAM/VRAM (approx) | Notes |
|---|---|---|---|---|
| STT | Faster Whisper base |
~150 MB | ~0.5 GB VRAM | CUDA accelerated; tiny to large-v3 available |
| LLM | Phi-3-mini GGUF Q4_K_M |
~2.0–2.6 GB | ~3–5 GB VRAM | LOCAL_LLM_GPU_LAYERS=-1 for full offload |
| TTS | Kokoro (HF mode) | ~300–500 MB | ~0.5–1.0 GB | Auto-downloads from HuggingFace Hub |
| Total (GPU) | Faster Whisper + Phi-3 + Kokoro | ~2.5–3.3 GB | ~4–7 GB VRAM | Validated ~665ms E2E on RTX 4090 |
Practical guidance:
- CPU-only host: 8 GB RAM minimum; 16 GB recommended for smoother turns and fewer OOM surprises.
- GPU host: VRAM needs depend on how many layers you offload; start with
LOCAL_LLM_GPU_LAYERS=0then increase gradually.
Mounted by docker-compose.yml:
- Host:
./models - Container:
/app/models
Defaults used by local_ai_server:
- STT:
LOCAL_STT_MODEL_PATH=/app/models/stt/... - LLM:
LOCAL_LLM_MODEL_PATH=/app/models/llm/... - TTS:
LOCAL_TTS_MODEL_PATH=/app/models/tts/...
Adjust these in .env to match your downloaded model locations.