The Local-First, High-Fidelity Voice Agent Engine
Velloris is a state-of-the-art framework for creating lifelike, interactive AI agents that run entirely on your local hardware. With three specialized modes, Velloris delivers the perfect voice AI solution for any use caseβfrom ultra-low latency conversations to professional-quality content creation.
Key Features:
- β‘ Three-Mode Architecture: β PersonaPlex-7B realtime S2S (VERIFIED WORKING) + β high-fidelity dubbing + β creative synthesis
- π Production-Ready: End-to-end speech-to-speech conversations, narration, and emotional synthesis
- π― Real-Time Speech-to-Speech: PersonaPlex-7B S2S full pipeline working (100ms input β 80ms output on RTX 3080) with 18 voice variants
- π Cross-Platform: Windows (NVIDIA CUDA) + macOS (Apple Metal/MPS) + Linux (CPU)
- π Optimized: Automatic device detection, lazy loading, mode-based routing
- π Privacy First: 100% local processing, no cloud dependencies
- π 10 Languages: Multilingual support via Qwen3-TTS
- π§ Ollama Optional: Required only for creative mode (LLM reasoning)
- Python 3.12+ (3.11+ supported)
- For Real-Time Mode: NVIDIA GPU (16GB+ VRAM) + CUDA 12.1+ + Triton (triton-windows for Windows)
- For Creative Mode: Ollama running (Download here)
- For Dubbing Mode: GPU recommended (6GB+ VRAM) or CPU
- macOS: Homebrew (for system dependencies)
- Windows/Linux: NVIDIA GPU recommended for best performance
- Note: PersonaPlex-7B S2S requires Triton for torch.compile() - automatically installed on supported platforms
git clone https://github.com/randsley/Velloris.git
cd VellorismacOS:
chmod +x install_macos.sh
./install_macos.shWindows:
# In PowerShell or Command Prompt
install_windows.batLinux / WSL2:
# Install system dependencies
sudo apt-get install -y portaudio19-dev ffmpeg sox libasound2-plugins pulseaudio-utils
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install requirements
pip install -r requirements-dev.txtWSL2 Note: Velloris auto-detects Ollama running on Windows and routes audio through PulseAudio/WSLg. Configure ALSA to use PulseAudio:
echo -e "pcm.default pulse\nctl.default pulse" > ~/.asoundrcpython3 main.py --show-configβ Status: VERIFIED WORKING - Full S2S inference pipeline on Windows/CUDA (100ms input β 80ms output on RTX 3080)
python3 main.py --mode realtime --persona "You are a helpful tutor" --voice natural_female_2Features:
- β‘ Sub-150ms latency on NVIDIA CUDA (verified on RTX 3080)
- β 18 voice variants: Natural (4F/4M) + Varied (5F/5M)
- β Full-duplex ready: Infrastructure for natural interruptions
- β Persona control: Custom roles via text prompts
- β No LLM needed: PersonaPlex-7B handles understanding + reasoning + speech generation
- β 24kHz audio: High-quality voice I/O
- β Windows support: Works with Triton-Windows for torch.compile() optimization
Available Voices:
- Natural Female:
natural_female_0,natural_female_1,natural_female_2,natural_female_3 - Natural Male:
natural_male_0,natural_male_1,natural_male_2,natural_male_3 - Varied Female:
varied_female_0throughvaried_female_4 - Varied Male:
varied_male_0throughvaried_male_4
Professional narration for content creation:
python3 main.py --mode dubbing --script "Your narration here"- π¨ Professional quality (24kHz)
- π 10 languages supported
- π Voice cloning available
- π― Best for: Audiobooks, podcasts, video narration
Emotional storytelling with LLM reasoning:
Start Ollama (if not running):
ollama serve
ollama pull llama3 # First time onlyRun Velloris (interactive prompt):
python3 main.py --mode creative --emotion "Speak with excitement"Type your prompts and Velloris responds with emotionally expressive speech.
WSL2: Ollama on Windows is auto-detected β no extra configuration needed.
- π§ LLM reasoning (Ollama)
- π Emotion control
- π Multilingual
- π― Best for: Storytelling, creative content
Velloris/
βββ core/ # Brain & Orchestration
β βββ brain.py # LLM integration + audio synthesis
β βββ orchestrator.py # Engine routing & lazy loading
βββ engines/ # Voice Models
β βββ personaplex.py # NVIDIA PersonaPlex-7B (S2S)
β βββ qwen_tts.py # Alibaba Qwen3-TTS (TTS)
β βββ mlx_tts.py # MLX-Audio for Apple Silicon
βββ utils/ # Utilities
β βββ audio_io.py # Audio playback & recording
β βββ audio_utils.py # Resampling & normalization
β βββ device_utils.py # Device detection (CUDA/MPS/CPU)
β βββ vad_handler.py # Voice Activity Detection
βββ tests/ # Test Suite (99 tests: 93 passing, 6 skipped)
β βββ test_pipeline.py # Integration tests (22 tests)
β βββ test_critical_paths.py # Critical path & platform tests (38 tests)
β βββ test_realtime_callbacks.py # Audio callback tests (15 tests)
β βββ test_realtime_e2e.py # End-to-end tests (14 tests)
β βββ test_vad_interruption.py # VAD & interruption tests (10 tests)
βββ config.py # Configuration
βββ main.py # CLI Application
βββ requirements.txt # Python Dependencies
βββ ARCHITECTURE.md # Detailed architecture guide
βββ LICENSE # Apache License 2.0
βββ README.md # This file
| Feature | Real-Time | Dubbing | Creative |
|---|---|---|---|
| Status | β VERIFIED WORKING | β Production | β Production |
| Latency | 80-150ms β‘ | N/A | 1-3s |
| Full-Duplex | Infrastructure ready | β No | β No |
| Interruption | VAD ready | β No | β No |
| Languages | English + accents | 10 languages | 10 languages |
| Voice Options | 18 variants | Unlimited | Unlimited |
| Persona Control | β Yes | β No | β Yes |
| Emotion Control | Built-in | β Yes | β Yes |
| Ollama Required | β No | β No | β Yes |
| GPU Required | β NVIDIA 16GB+ | Optional | Optional |
| Implementation | PersonaPlex-7B β | Qwen3-TTS β | Ollama + Qwen3-TTS β |
| Best For | Conversations | Narration | Creative content |
End-to-end speech-to-speech conversations with PersonaPlex-7B:
# Basic conversation
python3 main.py --mode realtime
# Custom persona
python3 main.py --mode realtime --persona "You are a friendly customer service representative"
# Different voice (natural female)
python3 main.py --mode realtime --voice natural_female_2 --persona "You are a helpful tutor"
# Different voice (varied male)
python3 main.py --mode realtime --voice varied_male_1 --persona "You are a tech expert"
# List available voices
python3 main.py --show-config | grep -A 20 "voice"Available Voices (18 total):
- Natural Female:
natural_female_0,natural_female_1,natural_female_2,natural_female_3 - Natural Male:
natural_male_0,natural_male_1,natural_male_2,natural_male_3 - Varied Female:
varied_female_0throughvaried_female_4 - Varied Male:
varied_male_0throughvaried_male_4
Performance (Verified Feb 2026):
- β‘ 80-150ms latency per 100ms audio chunk (RTX 3080) - 18x faster than cloud services
- β Full-duplex ready (natural interruptions)
- β Persona control via text prompts
- β No LLM needed (PersonaPlex-7B handles everything)
- β Cross-platform (Windows CUDA, macOS MPS, Linux CPU)
Professional narration with Qwen3-TTS:
# Simple narration
python3 main.py --mode dubbing --script "Hello world"
# With voice cloning (3-5 second sample)
python3 main.py --mode dubbing --script "Story text" --voice-ref my_voice.wav
# Specify device
python3 main.py --mode dubbing --script "Your script" --device cpuFeatures:
- π¨ Professional quality (24kHz output)
- π 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
- π Voice cloning from 3-second samples
- π¨ Voice design via natural language
Interactive emotional storytelling with Ollama + Qwen3-TTS:
# Start Ollama first (if not running)
ollama serve # In separate terminal
# Basic creative mode (interactive prompt)
python3 main.py --mode creative
# With emotion control
python3 main.py --mode creative --emotion "Speak poetically"
# Different LLM model
python3 main.py --mode creative --llm-model mistral --emotion "Excited tone"
# With specific device
python3 main.py --mode creative --emotion "Speak with warmth" --device cudaType prompts like "Tell me a story about space" and get voiced responses.
Features:
- π§ LLM reasoning (Ollama: llama3, mistral, mixtral, etc.)
- π Emotion control via natural language instructions
- π Multilingual support
- π¨ Creative flexibility
Auto-detect optimal device:
python3 main.py --device autoExplicit device selection:
python3 main.py --device cuda # NVIDIA GPU
python3 main.py --device mps # Apple Metal (M-series Mac)
python3 main.py --device cpu # CPU (slowest)python3 main.py --show-configDisplays:
- Platform info (OS, CPU, GPU)
- Device detection results
- Model configuration
- Audio settings
User Speech (24kHz)
β
PersonaPlex-7B (end-to-end S2S)
β’ Listen & Understand
β’ Reason & Respond
β’ Generate Speech
β
Agent Speech (24kHz) β Speaker π
Latency: 70-170ms β‘
Full-Duplex: β
Yes
Ollama: β Not needed
Script Text
β
Qwen3-TTS (High-Fidelity Synthesis)
β’ 10 languages
β’ Voice cloning
β’ Emotion control
β
Audio Output (24kHz) β Speaker π
Quality: Professional
Ollama: β Not needed
User Text
β
Ollama LLM (Reasoning/Creativity)
β
Response Text
β
Qwen3-TTS (Emotional Synthesis)
β
Audio Output (24kHz) β Speaker π
Flexibility: High
Ollama: β
Required
See ARCHITECTURE.md for detailed technical documentation.
- Optimal Performance: RTX 3000+ or newer
- Installation: Run
install_windows.bat - Device Selection:
--device cuda(auto-selected) - Optimizations Available: FlashAttention 2, bitsandbytes 4-bit quantization
- Supported: M1, M2, M3, M4 Pro/Max
- Installation: Run
./install_macos.sh - Device Selection:
--device mps(auto-selected) - Note: PersonaPlex runs slower on MPS; Qwen3-TTS works well
- MLX-Audio: Native MLX backend for optimized TTS on Apple Silicon with RMS normalization, chunk validation, and model caching
- CPU Mode: Works on any Linux
- CUDA Mode: Requires NVIDIA GPU + CUDA 12.1+
- System Dependencies:
portaudio19-dev ffmpeg sox libasound2-plugins - WSL2: Audio routed through PulseAudio/WSLg to Windows speakers. Ollama on Windows is auto-detected via gateway IP
See ARCHITECTURE.md for performance comparisons.
Run the full test suite (99 tests):
# All tests
pytest tests/ -v
# By category
pytest tests/test_pipeline.py -v # Integration tests (22)
pytest tests/test_critical_paths.py -v # Critical path & platform tests (38)
pytest tests/test_realtime_callbacks.py -v # Audio callback tests (15)
pytest tests/test_realtime_e2e.py -v # End-to-end realtime tests (14)
pytest tests/test_vad_interruption.py -v # VAD & interruption tests (10)
# With coverage
pytest tests/ --cov=. -vNote: Tests pass without models installed (stub mode). 93 pass, 6 skipped (platform-specific).
- ARCHITECTURE.md - Full system architecture, platform support, performance metrics
- LICENSE - Apache License 2.0
- Ensure system volume is up
- Check speaker/headphone connection
- Try:
python3 main.py --mode dubbing --device cpu
- Ensure internet connection (for Hugging Face downloads)
- Check disk space (~5GB for models)
- Verify Python 3.12+:
python3 --version
- This is informational if you're only using Dubbing Mode
- Only needed for Real-Time Mode with live speech
- MPS/Metal: Expected to be slower than CUDA
- CPU: Very slow; GPU recommended
- Solution: Use CPU mode with smaller model or wait longer
- Install PulseAudio and ALSA plugin:
sudo apt-get install -y pulseaudio-utils libasound2-plugins - Configure ALSA default:
echo -e "pcm.default pulse\nctl.default pulse" > ~/.asoundrc - Verify WSLg PulseAudio:
pactl info
- Ollama on Windows must listen on all interfaces: set
OLLAMA_HOST=0.0.0.0before runningollama serve - Add Windows firewall rule:
netsh advfirewall firewall add rule name="Ollama" dir=in action=allow protocol=TCP localport=11434 - Velloris auto-detects the Windows host IP β no manual config needed
- MLX-Audio Integration (macOS): Native MLX backend for Apple Silicon TTS
- Web UI with Gradio
- ONNX export for edge deployment
- Mobile optimization (iOS/Android)
- Multi-turn conversation memory
- Custom voice fine-tuning
- Real-time transcription display
Apache License 2.0 - See LICENSE file
Contributions welcome! Please open an issue or pull request on GitHub.
- Issues: Check GitHub Issues
- Documentation: See ARCHITECTURE.md
- Questions: Open a Discussion on GitHub
Built with β€οΈ for local-first AI