Community Test Matrix — Local AI Server

Help us build the definitive reference for what works best when running AAVA fully local. Submit your results via GitHub Issue or PR to this file.

How to Contribute

Automated (recommended)

After making a test call with the local provider, run:

agent rca --local
# or directly:
python3 scripts/local_test_report.py

This auto-detects your hardware, queries the Local AI Server for model info, parses docker logs for latency, and outputs a ready-to-paste submission template. Add --json for machine-readable output.

Manual

Run a test call using a Local AI Server configuration (any STT + TTS + LLM combination).
Record your results using the template below or the GitHub issue template.
Submit a PR adding a row to the results table, or open an issue with the community-test label.

What to Measure

STT Latency: Time from end of speech to transcript appearing in logs.
LLM Latency: Time from transcript to first LLM token (check local_ai_server logs for [LLM] timing).
TTS Latency: Time from LLM response to first audio byte.
End-to-End: Perceived time from user stops speaking to hearing the AI reply.
Call Quality: Subjective 1-5 rating (1 = unusable, 5 = indistinguishable from cloud).

Backend Compatibility Quick Reference

Backend	Type	CPU	GPU	Build Arg	Approx Size	Notes
Vosk	STT	Good	No benefit	`INCLUDE_VOSK=true` (default)	50-200 MB	Best CPU STT; real-time streaming
Sherpa-ONNX	STT	Good	No benefit	`INCLUDE_SHERPA=true` (default)	30-150 MB	Streaming; good multi-language
Kroko Cloud	STT	Yes	Yes	N/A	0	Requires API key at kroko.ai
Kroko Embedded	STT	Yes	Yes	`INCLUDE_KROKO_EMBEDDED=true`	~100 MB	Self-hosted ONNX server
Faster-Whisper	STT	Slow	Recommended	`INCLUDE_FASTER_WHISPER=true`	75-3000 MB	Auto-downloads from HuggingFace
Whisper.cpp	STT	Slow	Good	`INCLUDE_WHISPER_CPP=true`	75-3000 MB	Manual model download
Piper	TTS	Good	No benefit	`INCLUDE_PIPER=true` (default)	15-60 MB	Best CPU TTS; ONNX voices
Kokoro	TTS	OK	Better	`INCLUDE_KOKORO=true` (default)	~200 MB	Higher quality; multi-voice
MeloTTS	TTS	OK	Better	`INCLUDE_MELOTTS=true`	~500 MB	Multi-accent English
llama.cpp	LLM	Not recommended	Required	`INCLUDE_LLAMA=true` (default)	2-8 GB	CPU: 10-30s/response

Community Results

Legend

E2E: End-to-end perceived latency (user stops speaking → hears reply)
Quality: Subjective 1-5 (1=unusable, 3=usable, 5=cloud-quality)
Transport: em = ExternalMedia RTP, as = AudioSocket

Results Table

Date	Contributor	Hardware	GPU	STT Backend	STT Model	TTS Backend	TTS Voice	LLM Model	LLM Context	Transport	E2E Latency	Quality	Notes
2025-07-14	@maintainer	Vast.ai A100 40GB	A100	vosk	en-us-0.22	piper	lessac-medium	phi-3-mini Q4_K_M	2048	em	~2s	3	Baseline GPU test
2025-07-14	@maintainer	Vast.ai A100 40GB	A100	faster_whisper	base	kokoro	af_heart	phi-3-mini Q4_K_M	2048	em	~1.5s	4	Whisper + Kokoro combo
2026-02-22	@hkjarral	AMD EPYC 7443P, 66GB RAM	RTX 4090 24GB	faster_whisper	base	kokoro	af_heart	phi-3-mini-4k-instruct.Q4_K_M.gguf	4096	em	~665ms	4	Phi-3 tool calls can be malformed/truncated; use `LOCAL_TOOL_CALL_POLICY=auto` and keep `LOCAL_TOOL_GATEWAY_ENABLED=true` for structured full-local tool normalization
2026-02-23	@hkjarral	AMD EPYC 7443P, 66GB RAM	RTX 4090 24GB	faster_whisper	base	kokoro	af_heart	Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf	4096	em	~1.0s	5	Call `1771817082.317`: structured gateway + repair cleanly executed `hangup_call` on polite close (`Thank you.`), no tool-chatter leaked to spoken output, and post-call webhook executed successfully
2026-02-27	@hkjarral	AMD EPYC 7713, 98GB RAM	RTX 4090 24GB	kroko	embedded (en-US)	kokoro	af_heart	Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf	2048	em	~1.2s	5	Call `1772234505.97`: end-to-end local + GPU offload; barge-in and `hangup_call` succeeded (tool gateway `tool_path=heuristic`)
2026-02-27	@hkjarral	AMD EPYC 7713, 98GB RAM	RTX 4090 24GB	whisper_cpp	unknown	kokoro	af_heart	Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf	2048	em	~1.1s	2	Call `1772235703.109`: low coherence (telephony STT felt “not hearing”); transcripts arrived as short fragments; call ended without `hangup_call`

Detailed Submissions

**Date**: 2026-02-22
**Hardware**: AMD EPYC 7443P 24-Core Processor, 66GB RAM
**GPU**: NVIDIA GeForce RTX 4090 24GB
**OS**: Ubuntu 22.04.5 LTS
**Docker**: 29.2.1
**STT**: faster_whisper / Faster-Whisper (base)
**TTS**: kokoro / Kokoro (af_heart, mode=hf)
**LLM**: phi-3-mini-4k-instruct.Q4_K_M.gguf / n_ctx=4096
**LLM GPU Layers**: -1
**Transport**: ExternalMedia RTP
**Pipeline**: local
**Runtime Mode**: full
**E2E Latency**: ~665ms
**LLM Latency**: ~261ms avg (8 samples, last=265ms)
**STT Transcripts (last session)**: 9
**TTS Responses (last session)**: 9
**Quality (1-5)**: 4
**Notes**: Tool calls do not work reliably; use heuristic-based hangup for phi-3 (malformed/truncated tool-call markup observed).
**Tool Calls**:
  ⚠️ hangup_call: 2 attempted (not executed) [llm_markup]
  ✅ demo_post_call_webhook: 1 executed [post_call]

**Date**: 2026-02-23
**Hardware**: AMD EPYC 7443P 24-Core Processor, 66GB RAM
**GPU**: NVIDIA GeForce RTX 4090 24GB
**OS**: Ubuntu 22.04.5 LTS
**Docker**: 29.2.1
**STT**: faster_whisper / Faster-Whisper (base)
**TTS**: kokoro / Kokoro (af_heart, mode=hf)
**LLM**: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf / n_ctx=4096
**LLM GPU Layers**: -1
**Transport**: ExternalMedia RTP
**Pipeline**: local
**Runtime Mode**: full
**E2E Latency**: ~1.0s
**LLM Latency**: ~534ms avg (8 samples, last=612ms)
**STT Transcripts (last session)**: 8
**TTS Responses (last session)**: 10
**Quality (1-5)**: 5
**Notes**: Call `1771817082.317` was clean end-to-end. Local logs show strict structured tool gateway with a repair-path handoff (`tool_path=repair`) produced a valid `hangup_call` on user close intent (`Thank you.`), the engine executed a single hangup path, and no tool execution chatter leaked into final spoken text.
**Tool Calls**:
  ✅ hangup_call: 1 executed [local_llm]
  ✅ demo_post_call_webhook: 1 executed [post_call]

**Date**: 2026-02-23
**Hardware**: AMD EPYC 7443P 24-Core Processor, 66GB RAM
**GPU**: NVIDIA GeForce RTX 4090 24GB
**OS**: Ubuntu 22.04.5 LTS
**Docker**: 29.2.1
**STT**: kroko / Kroko (embedded, port 6006)
**TTS**: kokoro / Kokoro (af_heart, mode=hf)
**LLM**: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf / n_ctx=4096
**LLM GPU Layers**: -1
**Transport**: ExternalMedia RTP
**Pipeline**: local
**Runtime Mode**: full
**E2E Latency**: ~1.0s
**LLM Latency**: ~541ms avg (16 samples, last=625ms)
**STT Transcripts (last session)**: 35
**TTS Responses (last session)**: 19
**Quality (1-5)**: <your rating>
**Notes**: Natural voice quality
**Tool Calls**:
  ⚠️ hangup_call: 2 executed, 1 blocked [guardrail, local_llm]
  ✅ demo_post_call_webhook: 2 executed [post_call]

**Date**: 2026-02-23
**Hardware**: AMD EPYC 7443P 24-Core Processor, 66GB RAM
**GPU**: NVIDIA GeForce RTX 4090 24GB
**OS**: Ubuntu 22.04.5 LTS
**Docker**: 29.2.1
**STT**: kroko / Kroko (embedded, port 6006)
**TTS**: melotts / MeloTTS (EN-US)
**LLM**: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf / n_ctx=4096
**LLM GPU Layers**: -1
**Transport**: ExternalMedia RTP
**Pipeline**: local
**Runtime Mode**: full
**E2E Latency**: ~1.0s
**LLM Latency**: ~546ms avg (18 samples, last=608ms)
**STT Transcripts (last session)**: 52
**TTS Responses (last session)**: 22
**Quality (1-5)**: <your rating>
**Notes**: Start of the conversation is slow but then it picks up
**Tool Calls**:
  ✅ hangup_call: 2 executed [local_llm]
  ✅ demo_post_call_webhook: 2 executed [post_call]

**Date**: 2026-02-27
**Hardware**: AMD EPYC 7713 64-Core Processor, 98GB RAM
**GPU**: NVIDIA GeForce RTX 4090 24GB
**OS**: Ubuntu 22.04.5 LTS
**Docker**: 29.2.1 (Compose v5.1.0)
**STT**: kroko / Kroko (embedded, en-US)
**TTS**: kokoro / Kokoro (af_heart, mode=local)
**LLM**: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf / n_ctx=2048
**LLM GPU Layers**: 50 (runtime-selected)
**Transport**: ExternalMedia RTP
**Pipeline**: local
**Runtime Mode**: full
**E2E Latency**: ~1.2s (avg_turn_latency_ms=1172, max_turn_latency_ms=1846)
**Quality (1-5)**: 5
**Notes**: Call `1772234505.97` was clean end-to-end; `hangup_call` executed successfully via tool gateway fast-path heuristic.
**Tool Calls**:
  ✅ hangup_call: 1 executed [tool_path=heuristic]

**Date**: 2026-02-27
**Hardware**: AMD EPYC 7713 64-Core Processor, 98GB RAM
**GPU**: NVIDIA GeForce RTX 4090 24GB
**OS**: Ubuntu 22.04.5 LTS
**Docker**: 29.2.1
**STT**: whisper_cpp / Whisper.cpp
**TTS**: kokoro / Kokoro (af_heart, mode=local)
**LLM**: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf / n_ctx=2048
**LLM GPU Layers**: -1
**Transport**: ExternalMedia RTP
**Pipeline**: local
**Runtime Mode**: full
**E2E Latency**: ~1.1s (avg_turn_latency_ms=1077, max_turn_latency_ms=1888)
**Quality (1-5)**: 2
**Notes**: Call `1772235703.109`: low coherence; Whisper.cpp emitted short transcript fragments (e.g., split utterances) and did not feel as robust as Faster-Whisper in this telephony setup.
**Tool Calls**:
  ✅ demo_post_call_webhook: 1 executed [post_call]

Comparative Summary (2026-02-23 RTX 4090)

LLM latency stability: Both runs are tightly clustered (~541ms vs ~546ms avg) with similar tails (last=625ms vs 608ms).
TTS behavior: Kokoro notes “Natural voice quality”; MeloTTS notes “Start of the conversation is slow but then it picks up” (suggesting warm-up/caching or first-utterance overhead).
Guardrails: One extra hangup_call was blocked in the Kokoro run ([guardrail, local_llm]), while the MeloTTS run had only executed tool calls.
Throughput: MeloTTS run processed more transcripts (52 vs 35) and more TTS responses (22 vs 19) within the logged session, implying good steady-state performance under longer sessions.

Submission Template

Use this when adding a row or opening an issue:

**Date**: YYYY-MM-DD
**Hardware**: e.g., "Ryzen 7 5800X, 32GB RAM" or "Vast.ai RTX 4090 24GB"
**GPU**: e.g., "RTX 4090 24GB" or "None (CPU only)"
**STT**: Backend + model (e.g., "vosk / en-us-0.22" or "faster_whisper / base")
**TTS**: Backend + voice (e.g., "piper / lessac-medium" or "kokoro / af_heart")
**LLM**: Model + context (e.g., "phi-3-mini Q4_K_M / n_ctx=2048") or "Cloud (GPT-4o)"
**Transport**: ExternalMedia RTP or AudioSocket
**Pipeline**: local_only / local_hybrid / other
**E2E Latency**: Approximate (e.g., "~2s", "3-5s")
**Quality (1-5)**: Your rating
**Notes**: Any observations (echo issues, model switching behavior, etc.)

FAQ

Q: How do I measure latency? Set LOCAL_LOG_LEVEL=DEBUG and check timestamps in docker compose logs local_ai_server. Look for:

STT result → transcript timestamp
LLM response → first token timestamp
TTS audio → first byte timestamp

Q: What pipeline should I use?

local_only: All local (STT + LLM + TTS). Requires GPU for usable LLM latency.
local_hybrid: Local STT + TTS, cloud LLM (e.g., GPT-4o). Best quality on CPU.

Q: Can I test from a different machine? Yes — set up split-host mode. See docs/LOCAL_ONLY_SETUP.md for details on configuring LOCAL_WS_HOST=0.0.0.0 with LOCAL_WS_AUTH_TOKEN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Community Test Matrix — Local AI Server

How to Contribute

Automated (recommended)

Manual

What to Measure

Backend Compatibility Quick Reference

Community Results

Legend

Results Table

Detailed Submissions

Comparative Summary (2026-02-23 RTX 4090)

Submission Template

FAQ

FilesExpand file tree

COMMUNITY_TEST_MATRIX.md

Latest commit

History

COMMUNITY_TEST_MATRIX.md

File metadata and controls

Community Test Matrix — Local AI Server

How to Contribute

Automated (recommended)

Manual

What to Measure

Backend Compatibility Quick Reference

Community Results

Legend

Results Table

Detailed Submissions

Comparative Summary (2026-02-23 RTX 4090)

Submission Template

FAQ