This project turns a Raspberry Pi 5 into a voice-only assistant (Jarvis-style), powered by local open-source models.
No OpenAI API or cloud dependency required.
-
🎙️ Voice input with
arecord- Works with USB mics/speakerphones (tested on Poly Sync 10).
- Prime + retry logic ensures reliable capture.
-
🧠 Speech-to-Text (STT) via whisper.cpp
- Converts captured audio into text locally.
-
🤖 Local LLM with llama.cpp
- Runs models like
qwen2.5-7b-instructin server mode. - Reachable via REST API (
http://jarvispi.local:8080).
- Runs models like
-
🔊 Text-to-Speech (TTS) via piper
- Natural-sounding voices, runs fully on-device.
- Patched to handle sample-rate quirks.
- Added priming + prepended silence so the first word isn’t clipped.
-
⚡ Configurable via environment variables (mic device, sample rate, TTS timings, etc.)
- Record audio → USB mic with
arecord - Transcribe → Whisper model (tiny/small recommended on RPi5)
- LLM completion → POST transcript to
llama-server - TTS synthesis → Piper generates speech
- Playback → Audio routed back to speakerphone
-
Clone repos & build
llama.cpp(for local LLM)whisper.cpp(for STT)piper(for TTS)
-
Run llama-server (systemd unit provided):
curl http://jarvispi.local:8080/health
-
Set up voices (Piper):
# example voices/en_US-amy-medium.onnx voices/en_US-amy-medium.onnx.json -
Symlink espeak-ng data (fix missing phontab):
sudo ln -s /usr/lib/aarch64-linux-gnu/espeak-ng-data /usr/share/espeak-ng-data
-
Run the voice loop:
./ask_voice.py
| Variable | Default | Purpose |
|---|---|---|
MIC_DEV |
hw:2,0 |
ALSA device for mic capture |
RATE |
16000 |
Capture sample rate |
SECONDS_TO_RECORD |
5 |
Recording duration |
WHISPER_BIN |
/srv/asr/whisper.cpp/build/bin/whisper-cli |
Path to whisper binary |
WHISPER_MODEL |
/srv/asr/whisper.cpp/models/ggml-small.en.bin |
STT model |
LLAMA_HOST |
http://127.0.0.1:8080 |
llama-server REST endpoint |
N_PREDICT |
256 |
Tokens to predict |
PIPER_BIN |
/srv/tts/piper/build/piper |
Path to piper binary |
PIPER_MODEL |
/srv/tts/voices/en_US-amy-medium.onnx |
TTS voice model |
PIPER_CONFIG |
$PIPER_MODEL.json |
Voice config JSON |
SPEAKER_DEV |
plughw:2,0 |
ALSA device for playback |
TTS_PAD_MS |
400 |
Prepend silence (ms) to WAV |
TTS_PRIME_SEC |
0.30 |
Duration of ALSA /dev/zero prime |
TTS_SETTLE_SEC |
0.20 |
Wait after prime before playback |
✅ End-to-end local pipeline (record → Whisper → llama-server → Piper → speaker)
✅ No cloud dependencies (fully offline)
✅ Fix for clipped first words in audio playback
✅ Configurable timings for different hardware
✅ Web access to LLM via http://jarvispi.local:8080
- Add wake-word (e.g. with openWakeWord)
- Continuous VAD-based listening
- Auto-start
ask_voice.pyas a systemd service - Explore lightweight voices & smaller STT models for faster runtime
MIT — free to use, hack, and extend.