Run your own local LLM for the Asterisk AI Voice Agent using Ollama.
- No API Key Required - Fully self-hosted, no cloud dependencies
- Privacy - All data stays on your network
- No Usage Costs - Run unlimited calls without API fees
- Tool Calling Support - Compatible models can hang up calls, transfer, send emails
- Hardware:
- Mac Mini (M1/M2/M3) - Excellent performance
- Gaming PC with NVIDIA GPU (8GB+ VRAM recommended)
- Any machine with 16GB+ RAM for CPU-only inference
- Software: Ollama installed (download)
- Network: Ollama must be accessible from Docker containers
# macOS
curl -fsSL https://ollama.ai/install.sh | sh
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows
# Download from https://ollama.ai/download# Recommended: Llama 3.2 (supports tool calling)
ollama pull llama3.2
# Smaller model for limited hardware
ollama pull llama3.2:1b
# Alternative: Mistral (also supports tools)
ollama pull mistralBy default, Ollama only listens on localhost. For Docker to reach it, expose on your network:
# Start Ollama listening on all interfaces
OLLAMA_HOST=0.0.0.0 ollama serveOr set it permanently:
# Linux/macOS - add to ~/.bashrc or ~/.zshrc
export OLLAMA_HOST=0.0.0.0
# Then restart Ollama
ollama serve# macOS
ipconfig getifaddr en0
# Linux
hostname -I | awk '{print $1}'
# Windows
ipconfigpipelines:
local_ollama:
stt: local_stt
llm: ollama_llm
tts: local_tts
tools:
- hangup_call
- transfer
- send_email_summary
options:
llm:
base_url: http://192.168.1.100:11434 # Your IP address
model: llama3.2
temperature: 0.7
num_ctx: 8192 # Optional: match your model's context window
timeout_sec: 60
tools_enabled: trueactive_pipeline: local_ollamaThese models can use tools like hangup_call, transfer, send_email_summary:
| Model | Size | Tool Calling | Best For |
|---|---|---|---|
llama3.2 |
2GB | ✅ Yes | General use, good balance |
llama3.2:1b |
1.3GB | ✅ Yes | Limited hardware |
llama3.2:3b |
2GB | ✅ Yes | Better quality |
llama3.1 |
4.7GB | ✅ Yes | Higher quality |
mistral |
4.1GB | ✅ Yes | Fast, good quality |
mistral-nemo |
7.1GB | ✅ Yes | Best Mistral quality |
qwen2.5 |
4.7GB | ✅ Yes | Multilingual support |
qwen2.5:7b |
4.7GB | ✅ Yes | Good balance |
command-r |
18GB | ✅ Yes | Enterprise quality |
These models work for conversation but cannot execute actions:
| Model | Size | Notes |
|---|---|---|
phi3 |
2.2GB | Good for simple conversations |
gemma2 |
5.4GB | Google's model |
tinyllama |
637MB | Very small, limited quality |
Note: Models without tool calling will respond but cannot hang up calls, transfer, or send emails. Users must hang up manually.
-
Check Ollama is running:
curl http://localhost:11434/api/tags
-
Ensure network access:
# Must show 0.0.0.0:11434 OLLAMA_HOST=0.0.0.0 ollama serve -
Test from another machine:
curl http://YOUR_IP:11434/api/tags
-
Check firewall: Ensure port 11434 is open
- Local models are slower than cloud APIs
- Increase
timeout_secin config (try 120 for larger models) - Use a smaller model (
llama3.2:1binstead ofllama3.2)
# List installed models
ollama list
# Pull the missing model
ollama pull llama3.2- Check model supports tools (see table above)
- Ensure
tools_enabled: truein config - Check logs for "Ollama tool calls detected"
Some models may over-eagerly emit hangup_call tool calls even when the caller did not say goodbye.
Quick mitigations:
- Disable tools for Ollama: set
tools_enabled: false, or - Remove
hangup_callfrom your context’stools:list.
Docker containers cannot reach localhost. Use your host machine's IP:
# Wrong - Docker can't reach this
base_url: http://localhost:11434
# Correct - Use your actual IP
base_url: http://192.168.1.100:11434| Model Size | Minimum RAM | Recommended | GPU |
|---|---|---|---|
| 1B params | 4GB | 8GB | Optional |
| 3B params | 8GB | 16GB | Recommended |
| 7B params | 16GB | 32GB | Recommended |
| 13B+ params | 32GB+ | 64GB+ | Required |
- Use smaller models for voice (1B-3B is usually sufficient)
- Reduce max_tokens (100-200 for voice responses)
- Keep model loaded: Set
keep_alive: -1to prevent unloading
Ollama automatically uses GPU if available:
- Apple Silicon: Metal acceleration (automatic)
- NVIDIA: CUDA acceleration (requires drivers)
- AMD: ROCm support (Linux only)
- Go to Providers page
- Click Add Provider
- Select type: Ollama
- Enter your Ollama server URL
- Click Test Connection to verify and list models
- Select a model from the dropdown
- Save and restart AI Engine
pipelines:
local_ollama:
stt: local_stt
llm: ollama_llm
tts: local_tts
options:
llm:
base_url: http://192.168.1.100:11434
model: llama3.2:1b
max_tokens: 100
timeout_sec: 120pipelines:
local_ollama:
stt: local_stt
llm: ollama_llm
tts: local_tts
tools:
- hangup_call
- transfer
- send_email_summary
- request_transcript
options:
llm:
base_url: http://192.168.1.100:11434
model: llama3.2
temperature: 0.7
max_tokens: 200
timeout_sec: 60
tools_enabled: truepipelines:
local_ollama:
stt: local_stt
llm: ollama_llm
tts: local_tts
options:
llm:
base_url: http://192.168.1.100:11434
model: qwen2.5 # Good multilingual support
temperature: 0.7