Ares is a fully local voice assistant that combines:
- 🎤 Whisper.cpp for speech-to-text (STT)
- 🧠 Llama.cpp for natural language processing (LLM)
- 🔊 Piper for text-to-speech (TTS)
- 👂 OpenWakeWord for wake word detection ("Hey Ares")
Everything runs offline — no internet is required for processing.
Ares is designed to be modular and extendable to control smart devices or even robots.
✅ Wake Word ("Hey Ares/Ares")
- Powered by OpenWakeWord
- Starts listening only after hearing the wake phrase
✅ Speech-to-Text (STT) (CPU)
- Uses
sounddevice+webrtcvadfor smart recording (stops when you go quiet) - Transcribes audio with
whisper-cli(from Whisper.cpp)
✅ Local Language Model (LLM) (GPU)
- Runs
llama.cppin server mode - Configurable system prompt → "Jarvis"-like personality
✅ Text-to-Speech (TTS)
- Piper HTTP server generates natural-sounding voices
- Multiple voices available (e.g.,
en_US-bryce-medium)
✅ Main Pipeline
Wake Word → Record → Transcribe → Send to LLM → Speak Response
✅ Benchmarking
benchmark_ai.shlogs timings for STT, LLM, and TTS- Results are stored in
latency.md
✅ CI / Mock Mode
- GitHub Actions run Ares in Mock Mode (no audio hardware required)
- Simulates STT, LLM, and TTS responses for automated testing
pip install -r requirements.txtAlso build:
./server/run_servers.shSay "Hey Ares", wait for the beep 🎵, then speak your command. Ares will listen, process locally, and respond with speech.
./docs/benchmark_ai.shllmio/ # Input/output modules
├─ stt_whisper.py # Speech-to-text
├─ tts_piper.py # Text-to-speech
├─ llm_remote.py # LLM client
└─ wake_word.py # Wake word listener
scripts/ # Helper scripts
└─ run_servers.sh # Start LLM and TTS servers
latency.md # Benchmark results
ci_runner.py # CI harness with mocks
main.py # Main application loop
- Custom wake word — trainable per device/user.
- Custom voice — selectable TTS voice profiles.
- Open websites & apps on command
- Bluetooth device control — pair/connect/disconnect and volume controls.
- Visual detection — optional camera input for object/face/basic scene cues.
- Speaker recognition — per-user profiles for personalization/permissions.
- Remote access (multi-device clients)
- Custom hardware build — mic array, LEDs, physical mute, action button.
- User interface for better user experience