Skip to content

voidstarr/minimal-voice-assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ Real-time Voice Assistant

A minimal, real-time voice assistant powered by WebRTC streaming, featuring ultra-low latency audio processing with professional-grade components.

✨ Features

  • 🌊 WebRTC Streaming: Ultra-low latency audio streaming via FastRTC
  • 🎯 Advanced VAD: Silero voice activity detection with configurable parameters
  • 🎤 Speech Recognition: Whisper ASR for accurate speech-to-text transcription
  • 🔊 Neural TTS: High-quality Kokoro text-to-speech with natural-sounding voices
  • 🤖 AI Integration: Support for both Ollama (local) and OpenRouter GPT-5 Nano
  • 💬 Context-Aware: Maintains conversation history for coherent responses

🏗️ Architecture

┌─────────────┐    WebRTC     ┌──────────────────┐
│   Browser   │◄──────────────►│   FastRTC        │
│  (Client)   │   Ultra-low    │   Stream         │
└─────────────┘   latency      └──────────────────┘
                                        │
                                        ▼
                                ┌──────────────────┐
                                │  Silero VAD      │
                                │  (Voice Activity)│
                                └──────────────────┘
                                        │
                                        ▼
                                ┌──────────────────┐
                                │  Whisper ASR     │
                                │  (Speech-to-Text)│
                                └──────────────────┘
                                        │
                                        ▼
                                ┌──────────────────┐
                                │  LLM Processing  │
                                │  (Ollama/OpenAI) │
                                └──────────────────┘
                                        │
                                        ▼
                                ┌──────────────────┐
                                │  Kokoro TTS      │
                                │  (Text-to-Speech)│
                                └──────────────────┘
                                        │
                                        ▼
                                ┌──────────────────┐
                                │  Audio Stream    │
                                │  (Back to client)│
                                └──────────────────┘

📋 Requirements

  • Python: 3.12 or higher
  • uv: Modern Python package manager
  • System Dependencies:
    • Build tools (gcc/g++)
    • Python development headers
    • espeak-ng for phonemization
  • AI Backend: Either:
    • Ollama with gemma3:4b model (local), OR
    • OpenRouter API key for GPT-5 Nano (cloud)

🚀 Quick Setup

1. Clone the Repository

git clone https://github.com/voidstarr/minimal-voice-assistant.git
cd minimal-voice-assistant

2. Run Setup Script

The setup script will automatically:

  • Install uv if not present
  • Install system dependencies (gcc, python3-dev, espeak)
  • Create a virtual environment
  • Install all Python dependencies
  • Generate SSL certificates for HTTPS
chmod +x setup.sh
./setup.sh

3. Configure LLM Backend

Choose ONE of the following options:

Option A: Local Ollama (Recommended for Privacy)

  1. Install Ollama from ollama.ai
  2. Pull the required model:
    ollama pull gemma3:4b

Option B: OpenRouter Cloud API

  1. Create a .env file:
    echo "OPENROUTER_API_KEY=your_api_key_here" > .env
  2. Get your API key from OpenRouter

4. Run the Assistant

chmod +x run.sh
./run.sh

The assistant will start on https://localhost:7860

🎯 Usage

  1. Open the Interface: Navigate to https://localhost:7860 in your browser

    • Accept the self-signed certificate warning (this is expected for local development)
  2. Grant Microphone Access: Allow browser microphone permissions when prompted

  3. Start Speaking: The assistant will automatically detect when you start and stop speaking using advanced VAD

  4. Receive Responses: The AI will process your speech and respond with natural voice

⚙️ Configuration

Voice Activity Detection (VAD)

Edit voice_assistant.py to adjust VAD parameters:

self.vad_options = SileroVadOptions(
    threshold=0.5,                    # Speech detection sensitivity
    min_speech_duration_ms=250,       # Minimum speech length
    max_speech_duration_s=30.0,       # Maximum speech length
    min_silence_duration_ms=500,      # Silence before processing
    window_size_samples=1024,         # VAD processing window
    speech_pad_ms=200                 # Padding around speech
)

TTS Voice

Change the voice in the generate_tts method:

samples, sample_rate = self.kokoro.create(
    text, voice="af_heart", speed=1.0, lang="en-us"
)

Available voices depend on your Kokoro model configuration.

📁 Project Structure

minimal-voice-assistant/
├── voice_assistant.py      # Main application
├── pyproject.toml          # Project dependencies
├── setup.sh                # Setup script
├── run.sh                  # Run script
├── models/                 # TTS models
│   └── kokoro-v1.0.onnx   # Kokoro TTS model
├── ssl_certs/              # SSL certificates
│   ├── cert.pem
│   └── key.pem
└── README.md               # This file

🔧 Troubleshooting

SSL Certificate Warnings

Issue: Browser shows security warning

Solution: This is expected for self-signed certificates. Click "Advanced" and "Proceed" to continue. For production, use proper SSL certificates.

Microphone Not Working

Issue: No audio detected

Solution:

  • Ensure HTTPS is enabled (required for WebRTC)
  • Grant microphone permissions in browser
  • Check browser console for errors
  • Verify microphone works in other applications

Ollama Connection Failed

Issue: "Ollama connection failed" error

Solution:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama if needed
ollama serve

# Pull the model
ollama pull gemma3:4b

espeak Not Found

Issue: Phonemizer or espeak errors

Solution: Reinstall system dependencies:

# Ubuntu/Debian
sudo apt-get install espeak espeak-data libespeak-dev

# Fedora/RHEL
sudo dnf install espeak espeak-devel

Model Loading Errors

Issue: TTS or ASR models fail to load

Solution:

  • Ensure models/kokoro-v1.0.onnx is present
  • Check sufficient RAM (4GB+ recommended)
  • Verify Python 3.12+ is installed

🛠️ Development

Manual Installation

If you prefer not to use the setup script:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment
uv venv

# Activate environment
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate     # Windows

# Install dependencies
uv pip install -e .

Running Without SSL

For development without HTTPS (note: WebRTC may not work):

# In voice_assistant.py, modify the launch call:
interface.launch(
    server_name="0.0.0.0",
    server_port=7860,
    share=False
)

📊 Performance

  • Latency: ~200-500ms end-to-end (depends on LLM)
  • CPU Usage: Moderate (optimized for CPU-only operation)
  • RAM Usage: ~2-4GB (with models loaded)
  • Network: Minimal (except for OpenRouter API calls)

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

🙏 Acknowledgments

🔗 Links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors