Adoption: 1000+ Clones | Growing rapidly in the robotics & AI community
β Star this repo! With 50,000+ views on Reddit and interest from the Hacker News (YC) community, AXIOM is proving that you don't need massive compute for high-end AI. Help me show the world that optimized edge-native agents (4GB VRAM) can outperform the cloud.
AXIOM is a production-grade, fully offline voice agent...
A production-grade voice-first AI system for robotics labs. Combines real-time speech processing, intelligent intent classification, RAG-powered responses, and interactive 3D visualizationβall running locally with sub-400-ms latency.
- Contributing guide: CONTRIBUTING.md
- Code of conduct: CODE_OF_CONDUCT.md
- Issues and feature requests: https://github.com/pheonix-delta/axiom-voice-agent/issues
Interactive carousel with equipment cards and voice agent
Detailed equipment specifications and 3D models
Real-time voice interaction with visual feedback
AXIOM is a sophisticated voice agent built for robotics lab environments. It combines modern ML techniques with efficient inference pipelines to deliver:
- Instant Voice Interaction: Real-time speech processing with WebSocket communication
- Intelligent Intent Classification: SetFit-based intent recognition using secure .safetensors with 88%+ confidence thresholds. Eliminated pickle-based security risks with manual tensor inference.
- Context-Aware Responses: Semantic RAG with 2,116+ template responses
- 3D Interactive UI: WebGL-based carousel for visual equipment interaction
- Multi-turn Conversation: FIFO history management for contextual understanding
- Sub-2s Latency: Optimized for real-time conversational experience
- Clean TTS Output: Phonetic + minimal safe correctors (5m β 5 meters)
- Future-Ready Training: Interaction DB logs corrections for continuous improvement
- π Glued Interactions - Context-aware multi-turn dialogue with 5-interaction FIFO history (stores conversation context for natural coherence)
- β‘ Zero-Copy Inference - Direct tensor streaming from STT to LLM (94% memory reduction, 2.4% latency improvement)
- π¨ 3D Holographic UI - Interactive WebGL carousel with GPU-optimized lazy loading (streaming + progressive model loading)
- π£οΈ Dual Corrector Pipeline - Phonetic + minimal safe correctors for clean, natural TTS output
Quantitative analysis of AXIOM's response pipeline across different query types:
Component-level latency breakdown and system throughput metrics
End-to-end response time analysis across intent categories
See AXIOM in action with real voice interactions and system logs:
- Terminal Demo Log - Cleaned excerpts showing key interactions
- Asciinema Recording - Full terminal session recording
If you use this project in research, please cite the DOI:
@misc{axiom_voice_agent_2024,
title = {AXIOM: Advanced Voice Agent with Conversational Intelligence},
author = {Shubham Dev},
year = {2024},
doi = {10.13140/RG.2.2.26858.17603},
url = {https://doi.org/10.13140/RG.2.2.26858.17603}
}βββββββββββββββββββββββ
β Browser (Web UI) β
β - Voice Capture β
β - 3D Visualization β
ββββββββββββ¬βββββββββββ
β WebSocket
β
ββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend Server β
ββββββββββββββββββββββββββββββββββββββββββββ€
β ββ STT Pipeline ββββββββββββββββββββββ β
β β β’ Sherpa-ONNX Parakeet β β
β β β’ Silero VAD (Voice Detection) β β
β β β’ Phonetic + Minimal Safe Correctorβ β
β ββββββββββββββββββββββββββββββββββββββ β
β ββ Intent Classification βββββββββββββ β
β β β’ SetFit Model (Local inference) β β
β β β’ 15+ Intent classes β β
β ββββββββββββββββββββββββββββββββββββββ β
β ββ Response Pipeline βββββββββββββββββ β
β β β’ Template-based bypass (80% QPS) β β
β β β’ Semantic RAG handler β β
β β β’ Ollama LLM fallback β β
β ββββββββββββββββββββββββββββββββββββββ β
β ββ TTS Engine ββββββββββββββββββββββββ β
β β β’ Kokoro TTS (Sherpa-ONNX) β β
β β β’ Sequential queue (no echo) β β
β β β’ TTS-safe text normalization β β
β ββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββ
β (Data Persistence)
SQLite Database
(Conversation History)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser (Web UI) β
β β’ Voice Capture (MediaDevices) β’ 3D WebGL Carousel β
β β’ Real-time Waveform Display β’ Equipment Visualization β
ββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ-β
β WebSocket (Binary + JSON)
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend (main_agent_web.py) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β INPUT β [STT] β [Intent] β [Response] β [TTS] β OUTPUT β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β--
β β 1. SPEECH-TO-TEXT (STT) β β
β β β’ Model: Sherpa-ONNX Parakeet-TDT (200MB) β β
β β β’ Speed: <100ms inference β β
β β β’ Tech: Transducer-based streaming recognition β β
β β β’ File: backend/stt_handler.py β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β-
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 2. INTENT CLASSIFICATION β β
β β β’ Model: SetFit (Secure `model_head.safetensors` migration)β β
β β β’ Speed: <50ms inference β β
β β β’ Labels: equipment_query, project_ideas, etc. (9) β β
β β β’ Security: Zero-copy manual tensor math (No Pickle) β β
β β β’ File: backend/intent_classifier.py β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 3. CONTEXT INJECTION (Glued Interactions) β β
β β β’ Stores: Last 5 interactions in SQLite β β
β β β’ Injects: Previous context into LLM prompt β β
β β β’ Benefit: Natural multi-turn dialogue β β
β β β’ File: backend/conversation_manager.py β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 4. RESPONSE GENERATION β β
β β ββ 80% TEMPLATE PATH (Fast) β β
β β β β’ 2,116 pre-generated responses β β
β β β β’ <10ms latency, 100% deterministic β β
β β β β’ Covers common equipment queries β β
β β β β β
β β ββ 20% RAG+LLM PATH (Intelligent) β β
β β β’ Semantic RAG: Searches knowledge bases β β
β β β’ LLM: Ollama with drobotics_test model β β
β β β’ Sources: 1,806 facts + 325 project ideas β β
β β β’ Latency: ~100-500ms β β
β β β β
β β File: backend/semantic_rag_handler.py β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 5. TEXT-TO-SPEECH (TTS) β β
β β β’ Model: Kokoro-EN (Sherpa-ONNX based, 150MB) β β
β β β’ Speed: <200ms per sentence β β
β β β’ Tech: Sequential FIFO queue (prevents echo) β β
β β β’ File: backend/sequential_tts_handler.py β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 6. 3D MODEL MAPPING β β
β β β’ Keyword Extraction: equipment names β β
β β β’ Carousel Trigger: robot_dog β unitree_go2.glb β β
β β β’ Files: backend/keyword_mapper.py β β
β β backend/model_3d_mapper.py β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€-
β Data Layer (Persistent) β
β β’ SQLite: Conversation history (data/web_interaction_*.db) β
β β’ JSON: Knowledge bases (data/*.json) β
β β’ Static: 3D models (assets/3d v2/*.glb) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Purpose | Tech Stack |
|---|---|---|
| STT Handler | Convert audio β text | Sherpa-ONNX + Silero VAD |
| Intent Classifier | Detect user intent | SetFit (sentence-transformers) |
| RAG Handler | Search knowledge bases | Sentence-Transformers embeddings |
| Conversation Manager | Maintain context | Python deque + SQLite |
| Template Responses | Fast replies | 2,116 JSON templates |
| Ollama Interface | Complex queries | Ollama + drobotics_test model |
| TTS Handler | Generate speech | Kokoro-EN (Sherpa-ONNX) |
| 3D Mapper | Equipment β GLB files | Keyword extraction |
| WebSocket Server | Real-time communication | FastAPI + uvicorn |
- Phonetic Corrector: TTS-friendly conversion of units and domain terms
- Example: "5m" β "5 meters", "jetson nano" β "Jetson Nano"
- Minimal Safe Corrector: Removes markdown/noise without changing meaning
- Example:
**bold**,*italic*,`code`β plain text
- Example:
- Template Bypass: Short, verified replies when confidence is high
- Saves GPU/LLM resources and improves latency
- Python: 3.10+
- RAM: 8GB minimum (16GB recommended)
- VRAM: 2-3.6GB for GPU acceleration (optionalβCPU mode works too)
- Disk: 1GB for models (Kokoro, Sherpa, SetFit)
# Clone repository
git clone https://github.com/pheonix-delta/axiom-voice-agent.git
cd axiom-voice-agent
# Create virtual environment (recommended name: axiomvenv)
python3 -m venv axiomvenv
source axiomvenv/bin/activate # Linux/Mac
# or
axiomvenv\Scripts\activate # Windows
# Install dependencies (avoid --break-system-packages; use the venv)
pip install -r requirements.txtModels are symlinked from your system. Verify they're accessible:
# Check symlinks
ls -la models/
# Output should show:
# kokoro-en-v0_19 -> ../../kokoro-en-v0_19
# sherpa-onnx-... -> ../../sherpa-onnx-...
# If symlinks are broken, set environment variables:
export KOKORO_PATH=/path/to/kokoro-en-v0_19
export SHERPA_PATH=/path/to/sherpa-onnx-...π See MODEL_PATH_RESOLUTION.md for complete setup options:
- Environment variables (recommended)
- Creating symlinks
- Configuration files (.env)
- Troubleshooting broken paths
cd backend
python main_agent_web.py
# Output:
# INFO: Application startup complete
# INFO: Uvicorn running on http://0.0.0.0:8000Navigate to:
http://localhost:8000
ποΈ Click the microphone icon and start speaking!
localhost or 127.0.0.1 (not IP addresses) for browser microphone permissions.
axiom-voice-agent/ # Root directory
β
βββ π QUICK START
β βββ README.md # β You are here
β βββ QUICK_START.md # Detailed feature walkthrough
β βββ PRE_PUBLICATION_CHECKLIST.md # OSS deployment checklist
β
βββ π DOCUMENTATION
β βββ docs/ARCHITECTURE.md # Complete system design
β βββ OSS_DEPLOYMENT_GUIDE.md # Symlinks, SetFit, Git LFS, licensing
β βββ CONTRIBUTING.md # Contributor guidelines
β βββ SECURITY.md # Vulnerability disclosure
β βββ SYSTEM_SANITY_AND_OSS_READINESS_REPORT.md
β βββ QUICK_REFERENCE_QA.md # FAQ for symlinks, SetFit, license
β βββ LICENSE # Apache 2.0 license
β
βββ π§ BACKEND (Python)
β βββ backend/
β β βββ main_agent_web.py # π― START HERE: FastAPI + WebSocket server
β β βββ stt_handler.py # Speech-to-Text (Sherpa-ONNX)
β β βββ intent_classifier.py # Intent detection (SetFit)
β β βββ semantic_rag_handler.py # RAG search + Ollama LLM
β β βββ sequential_tts_handler.py # Text-to-Speech (Kokoro)
β β βββ conversation_manager.py # π Glued Interactions (context history)
β β βββ conversation_orchestrator.py # Context injection into LLM
β β βββ template_responses.py # 2,116 pre-generated responses
β β βββ model_3d_mapper.py # Equipment name β GLB file mapping
β β βββ keyword_mapper.py # Extract equipment names from text
β β βββ vad_handler.py # Voice Activity Detection (Silero)
β β βββ axiom_brain.py # Ollama interface
β β βββ config.py # Centralized path configuration
β β βββ [other handlers...] # Vocabulary, minimal corrections, etc.
β βββ requirements.txt # Python dependencies
β
βββ π¨ FRONTEND (Web UI)
β βββ frontend/
β β βββ voice-carousel-integrated.html # π― START HERE: Web UI + 3D carousel
β β βββ audio-capture-processor.js # Audio streaming + WebSocket
β βββ assets/3d v2/ # 3D equipment models (GLB format)
β βββ robot_dog_unitree_go2.glb # Quadruped robot (2.5MB)
β βββ jetson_orin.glb # AI computer
β βββ lidar_sensor.glb # Sensor visualization
β βββ [50+ more equipment models...]
β
βββ π§ MODELS (Pre-trained, Symlinked)
β βββ models/
β β βββ kokoro-en-v0_19/ # TTS model (symlink β ../../kokoro-en-v0_19)
β β βββ sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/ # STT (symlink)
β β βββ intent_model/
β β β βββ setfit_intent_classifier/ # SetFit intent classifier (30MB, Git-tracked)
β β βββ silero_vad.onnx # Voice detection (40MB)
β β βββ Modelfile.drobotics_test # Ollama model recipe
β β βββ DROBOTICS_TEST.md # Model documentation
β βββ Note: Large models are symlinked from parent dir to avoid duplication
β
βββ π DATA (Knowledge Bases)
β βββ data/
β β βββ template_database.json # 2,116 Q&A template responses
β β βββ rag_knowledge_base.json # 1,806 technical facts
β β βββ project_ideas_rag.json # 325 robotics project suggestions
β β βββ inventory.json # 27 equipment specifications
β β βββ carousel_mapping.json # Keyword β GLB file mappings
β β βββ web_interaction_history.db # SQLite: Conversation history
β βββ Note: All data files are flat JSON (easy to edit, extend, version control)
β
βββ β SPECIAL FEATURES (Innovation Demos)
β βββ special_features/
β β βββ GLUED_INTERACTIONS_DEMO.md # Multi-turn context demo
β β βββ ZERO_COPY_INFERENCE.md # Memory optimization details
β β βββ 3D_HOLOGRAPHIC_UI.md # 3D frontend architecture
β β βββ test_glued_interactions.py # Test script for context injection
β β βββ README.md # Feature validation guide
β βββ Note: See achievements/ for innovation analysis
β
βββ π¬ RESEARCH & TRAINING
β βββ setfit_training/ # SetFit model training scripts
β β βββ scripts/ # Training pipeline
β β βββ generated/ # Training datasets
β βββ research/ # Design decisions
β βββ benchmarks/ # Performance metrics
β βββ Note: Model training is reproducibleβretrain anytime
β
βββ π ROOT FILES
βββ FEATURES.md # Feature matrix
βββ ACHIEVEMENTS_AND_INNOVATION.md # Innovation documentation
βββ PATH_FIX_SUMMARY.md # Path integrity notes (for reference)
βββ requirements.txt # Python dependencies
βββ .gitignore # Git ignore patterns (includes .env)
| Task | File | What to Do |
|---|---|---|
| Add new equipment response | data/template_database.json |
Add {"intent": "...", "response": "..."} |
| Add new technical fact | data/rag_knowledge_base.json |
Add {"topic": "...", "fact": "..."} |
| Add new project idea | data/project_ideas_rag.json |
Add project object |
| Add new equipment specs | data/inventory.json |
Add equipment object |
| Map new equipment to 3D model | data/carousel_mapping.json |
Add {"keyword": "name", "glb_file": "file.glb"} |
| Add new intent labels | Retrain SetFit | See setfit_training/scripts/train_production_setfit.py |
| Add custom environment variables | backend/config.py |
Add os.getenv() call |
| Document | Purpose | For Whom |
|---|---|---|
| README.md (this file) | Overview + quick start | Everyone |
| QUICK_START.md | Feature walkthrough + examples | Users trying features |
| docs/ARCHITECTURE.md | Complete system design | Developers, architects |
| OSS_DEPLOYMENT_GUIDE.md | Symlinks, SetFit, licensing | Open-source maintainers |
| CONTRIBUTING.md | Contributor guidelines | Code contributors |
| SECURITY.md | Vulnerability disclosure | Security researchers |
| QUICK_REFERENCE_QA.md | FAQ (symlinks, SetFit, license) | Quick answers |
| special_features/ | Innovation deep-dives | Advanced users |
Problem: Voice bots typically treat each query as isolated, lacking conversation context.
Solution: Maintain a FIFO queue of last 5 interactions, inject context into LLM prompts.
User 1: "Tell me about Jetson Orin"
β Stored: {query, intent, response, confidence, timestamp}
User 2: "Does it support cameras?"
WITHOUT context: "I don't know what 'it' refers to"
WITH context (LLM sees): "Earlier we discussed Jetson Orin with 12GB memory..."
β Response: "Yes, Jetson Orin supports RealSense D435i cameras..."
Implementation:
- Storage: SQLite database (
data/web_interaction_history.db) - Manager:
backend/conversation_manager.py(Pythondeque, max 5 items) - Injector:
backend/conversation_orchestrator.py(context in LLM system prompt) - Impact: +100ms latency for dramatically improved coherence
- Testing:
python special_features/test_glued_interactions.py
Problem: Traditional ML pipelines copy data 3+ times: STT β String β Tokens β GPU (8.5MB per inference).
Solution: Use NumPy frombuffer() to stream STT output directly as GPU tensors (0 memory copies).
Traditional: STT β String (COPY 1) β Tokens (COPY 2) β GPU (COPY 3) = 8.5MB
Zero-Copy: STT β String (same address) β Tokens (same address) β GPU (same address) = 0.5MB
Key Optimization:
# β Creates memory copy
data = np.array(bytes_input)
# β
Creates memory view (zero-copy)
data = np.frombuffer(bytes_input, dtype=np.int16)Benefits:
- 94% memory reduction: 8.5MB β 0.5MB per inference
- 2.4% latency improvement: ~10ms faster
- Scalability: Supports 100+ concurrent users on single instance
- Implementation:
backend/stt_handler.py(NumPy integration with Ollama) - Testing:
python special_features/validate_zero_copy_inference.py
Problem: Heavy 3D assets (~300MB) consume browser memory and network bandwidth.
Solution: Stream + lazy load models on-demand, keep max 3 in VRAM, auto-dealloc when off-screen.
User: "Show me the robot dog"
β STT
"Show me the robot dog"
β Intent Detection
equipment_query
β Keyword Mapper
"robot dog"
β Model 3D Mapper
"robot_dog_unitree_go2.glb"
β Frontend Lazy Load
Model fetches from /3d v2/ (if not cached)
β WebGL Render
3D quadruped appears, auto-rotates
Server-Side Delivery:
# backend/main_agent_web.py - Line 52
app.mount("/3d v2", StaticFiles(directory="/home/user/Desktop/voice agent/axiom-voice-agent/assets/3d v2"), name="3d_models")
---
### π£οΈ Feature 4: Dual Corrector Pipeline (Clean TTS Output)
**Problem**: Raw model output contains units, punctuation, and artifacts that sound wrong in speech.
**Solution**: Two-stage correction before TTS:
1. **Phonetic Corrector**: Expands units and domain terms (e.g., "5m" β "5 meters")
2. **Minimal Safe Corrector**: Removes markdown/noise without changing meaning
**Implementation**:
- **Phonetic**: `backend/vocabulary_handler.py`
- **Minimal Safe**: `backend/minimal_safe_corrector.py`
- **Applied in**: `backend/sequential_tts_handler.py`
**Benefits**:
- Consistent speech pronunciation
- Fewer misreads of symbols/units
- Cleaner audio output for demos- HTTP delivery with gzip compression (40% reduction)
- Browser caches frequently used models
- Conditional requests (304 Not Modified) minimize transfer
Client-Side Lazy Loading:
// Load ONLY when visible
loadModelOnScroll() {
if (cardVisible && !modelLoaded) {
fetch('/3d v2/model.glb')
.then(r => r.arrayBuffer())
.then(buffer => GLTFLoader.parse(buffer))
.then(model => scene.add(model))
}
}
// Free GPU memory for off-screen models
onScrollOut() {
scene.remove(model)
geometry.dispose() // Release VRAM
material.dispose()
texture.dispose()
}GPU Memory Management:
- Max Concurrent: 3 models in VRAM
- Progressive: Pre-fetch adjacent cards
- Auto-Dealloc: Off-screen cleanup
- Cache: Browser + IndexedDB for offline
Network Efficiency:
| Stage | Time | Size |
|---|---|---|
| Page Load | 2-5s | 50KB (no models) |
| First Render | 0.5-1s | 5-20MB (1-2 models) |
| Scrolling | 60 FPS | Max 3 in VRAM |
| Mobile | Works | <500MB available |
Implementation:
- Frontend: Google
<model-viewer>web component (CDN-loaded) - Backend Mapping:
backend/model_3d_mapper.py(keywordβGLB) - Keyword Extraction:
backend/keyword_mapper.py - Models: GLB format in
assets/3d v2/ - Testing: Start server β Say equipment names β Check DevTools Network tab
Supported Models:
robot dog / unitree go2 β 3D quadruped
jetson β AI computer
lidar β Sensor visualization
raspberry pi β Single-board computer
(50+ more equipment models)
| Metric | Traditional | With Optimizations |
|---|---|---|
| STT Memory | 150MB | 150MB (same) |
| Inference Memory | 8.5MB/call | 0.5MB/call (94% reduction) |
| Total Latency | ~2.5s | ~2.0s (2.4% improvement) |
| 3D Load Time | 5+ mins (all models) | 0.5s/model (lazy loading) |
| Concurrent Users | 10-20 | 100+ (zero-copy benefit) |
| Context Quality | Isolated queries | Natural multi-turn (glued interactions) |
- Model: Sherpa-ONNX (Parakeet-TDT, 0.6B quantized)
- Inference: <100ms on CPU
- Post-processing: Phonetic corrections for domain-specific terms
- Model: SetFit (fine-tuned on robotics domain)
- Inference: <50ms
- Coverage: 15 intent classes (equipment_query, project_ideas, etc.)
- Threshold: 88%+ confidence for template bypass
- 80% Template-Based: Fast, deterministic responses
- 20% RAG+LLM: Complex queries using knowledge bases
- RAG Sources:
- Equipment specifications (27 items)
- Technical knowledge (1,806 facts)
- Project ideas (325 items)
- Model: Kokoro-EN (Sherpa-ONNX based)
- Inference: <200ms per sentence
- Queue System: Prevents audio echo/overlap
User: "Tell me about the robot dog"
β
[VAD Detection] β Voice detected β
β
[STT] β "Tell me about the robot dog"
β
[Intent Classifier] β equipment_query (0.91 confidence)
β
[Confidence Check] β 0.91 > 0.88 β
β
[Template Handler] β Retrieves pre-generated response
β
[TTS] β Streams audio to client
β
[UI] β Carousel highlights "Robot Dog" card + 3D model
Extracted from training data, covers:
- Equipment specifications
- Lab procedures
- Common troubleshooting
- Project recommendations
Organized by domain:
- Mechanical systems
- Electrical integration
- Software frameworks
- Best practices
Project suggestions indexed by:
- Difficulty level
- Equipment required
- Estimated duration
- 3D Model Carousel: WebGL rendering of equipment
- Voice Waveform: Visual feedback during speech
- Status Indicators: Intent confidence, processing state
- Card Highlighting: Context-aware UI updates
- Browser MediaDevices API: Direct microphone access
- WebSocket Streaming: 512-sample chunks (32kHz, Int16)
- Client-side VAD: Reduces server load
- Echo Cancellation: Built-in browser support
| Component | Latency | Memory | VRAM |
|---|---|---|---|
| STT | <100ms | 150MB | 200MB |
| Intent | <50ms | 80MB | 100MB |
| Template | <10ms | 50MB | - |
| RAG | <100ms | 200MB | 500MB |
| TTS | <200ms | 120MB | 300MB |
| Total | <2s | ~1GB | ~3.6GB |
Create .env file in backend/:
AXIOM_MODEL=drobotics_test
TTS_DEVICE=cuda # or cpu
STT_NUM_THREADS=4- STT:
models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/ - TTS:
models/kokoro-en-v0_19/ - Intent:
models/intent_model/setfit_intent_classifier/ - VAD:
models/silero_vad.onnx
Message Format:
{
"type": "audio_chunk",
"data": "<base64 encoded audio bytes>",
"chunk_index": 42
}Response:
{
"type": "response",
"text": "Here's information about the robot dog...",
"intent": "equipment_query",
"confidence": 0.91,
"card_trigger": "robot_dog"
}- Add examples to training data
- Retrain SetFit model:
python models/train_setfit.py - Update
template_database.jsonwith new responses
- Add facts to
data/rag_knowledge_base.json - Update templates in
data/template_database.json - (Optional) Retrain semantic embeddings
# Check model loading
python -c "from backend.intent_classifier import IntentClassifier; ic = IntentClassifier(); print(ic.labels)"
# Test STT
python -c "from backend.stt_handler import STTHandler; stt = STTHandler(); print('STT ready')"
# View conversation history
sqlite3 data/web_interaction_history.db "SELECT * FROM interactions LIMIT 5;"- Session Management: One connection per user (can scale to 100+ concurrent users with proper resource allocation)
- Model Caching: Models are loaded once at startup
- Database: SQLite suitable for <10K interactions/day
- For Production: Consider PostgreSQL, Redis caching, load balancing
Symptoms: Browser shows "No microphone permission" or microphone appears inactive.
Solutions:
-
Use localhost, not IP addresses
- β
http://192.168.1.100:8000(won't work) - β
http://localhost:8000(works) - β
http://127.0.0.1:8000(works)
- β
-
Check browser microphone permissions
- Click padlock icon in address bar
- Ensure "Microphone" is set to "Allow"
- Refresh page
-
Test microphone in system settings
- Linux:
pavucontroloralsamixer - macOS: System Preferences β Sound β Input
- Windows: Settings β Sound β Volume levels
- Linux:
Symptoms: Error like "Model not found" or "No such file or directory"
Solutions:
# 1. Check symlinks
cd models/
ls -la # Should show: kokoro-en-v0_19 -> ../../kokoro-en-v0_19
# 2. If symlinks are broken, verify parent directories exist
ls -la ../../kokoro-en-v0_19/
ls -la ../../sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/
# 3. If parent dirs don't exist, set environment variables
export KOKORO_PATH=/path/to/kokoro-en-v0_19
export SHERPA_PATH=/path/to/sherpa-onnx-...
python main_agent_web.py
# 4. See OSS_DEPLOYMENT_GUIDE.md Section 2 for complete symlink setupSymptoms: 5+ second delay before hearing response.
Solutions:
-
Check GPU memory
nvidia-smi # Should show < 80% usage- If near 100%, reduce concurrent clients or use CPU mode
-
Use template-based responses (faster)
- Ask about equipment specs (equipment_query intent)
- 80% of queries should trigger fast templates
-
Check CPU load
top -p $(pgrep -f "python main_agent_web.py")- If > 90%, server is overloaded
-
Use fewer concurrent connections
- Each WebSocket connection uses ~500MB RAM
- Max ~20-30 concurrent on typical hardware
Symptoms: Robotic voice overlapping or stuttering audio.
Solutions:
-
Sequential TTS Queue (prevents echo)
- Already built-in (
backend/sequential_tts_handler.py) - If still happening, check browser console for errors
- Already built-in (
-
Reduce microphone input level
- System Settings β Sound β Input volume at 70-80%
-
Restart server
# Stop: Ctrl+C python main_agent_web.py # Restart
Symptoms: SQLite locked error or corrupt database.
Solutions:
# 1. Reset conversation history
rm data/web_interaction_history.db
# 2. Or check database integrity
sqlite3 data/web_interaction_history.db "PRAGMA integrity_check;"
# 3. Restart server (will auto-create fresh database)
python main_agent_web.pySymptoms: Error about "setfit_intent_classifier not found"
Solutions:
# 1. Verify SetFit is installed
pip install setfit>=1.0.3
# 2. Check model directory
ls -la models/intent_model/setfit_intent_classifier/
# Should contain: config.json, model.safetensors, etc.
# 3. Verify it's in requirements.txt
grep "setfit" requirements.txtSymptoms: Empty carousel or "Failed to load model" in console.
Solutions:
# 1. Check 3D assets directory
ls -la assets/3d\ v2/*.glb | head -5
# Should show .glb files
# 2. Test model loading from server
curl http://localhost:8000/3d\ v2/robot_dog_unitree_go2.glb -I
# Should return 200 OK
# 3. Check browser console (F12)
# Look for 404 errors on /3d v2/ URLs| Component | Model | Base License | Attribution | Notes |
|---|---|---|---|---|
| LLM | Llama 3.2 3B | Meta Community | Meta AI | Fine-tuned as drobotics_test |
| STT | Sherpa-ONNX Parakeet-TDT 0.6B | Apache 2.0 | Xiaoomi Wenet | Quantized INT8 |
| TTS | Kokoro-EN | Apache 2.0 | LJSpeech | Sherpa-ONNX optimized |
| Intent Classification | SetFit | Apache 2.0 | Hugging Face | 9 robotics intents |
| Semantic Search | All-MiniLM-L6-v2 | Apache 2.0 | Sentence-Transformers | RAG embeddings |
| VAD | Silero VAD | MIT | Silero AI | Voice activity detection |
AXIOM Voice Agent is licensed under Apache 2.0.
Copyright 2024-2026 AXIOM Contributors
Licensed under the Apache License, Version 2.0
See LICENSE file for full terms
What This Means:
- β Free for Commercial Use: Build products on top of AXIOM
- β Open Source: Source code available for modification
- β Patent Protection: Explicit patent grant included
- β Attribution Required: Must include LICENSE + acknowledge changes
- β Derivatives Allowed: Modifications can be kept private
- β No Warranty: Use at your own risk
- Fork the repository
- Create branch:
git checkout -b feature/your-feature - Make changes (follow CONTRIBUTING.md style guide)
- Test and document
- Submit pull request with description
See CONTRIBUTING.md for detailed guidelines.
Do NOT open public issues. See SECURITY.md for responsible disclosure.
- Usage Questions: Check QUICK_START.md and OSS_DEPLOYMENT_GUIDE.md
- Technical Discussions: See docs/ARCHITECTURE.md
- GitHub Discussions: Ask in Issues with
questionlabel
- QUICK_START.md - Try each feature with examples
- docs/ARCHITECTURE.md - Complete system design
- special_features/ - Innovation deep-dives
- OSS_DEPLOYMENT_GUIDE.md - Symlinks, licensing, Git LFS
- CONTRIBUTING.md - How to contribute code
- SECURITY.md - Report security vulnerabilities responsibly
- QUICK_REFERENCE_QA.md - FAQ (symlinks, SetFit, license)
- Check the docs first (linked above)
- Search existing issues on GitHub
- Ask in GitHub Discussions with clear context
- Report bugs with reproduction steps + OS details
This project demonstrates:
- β Shubham Dev: Primary architect (Axiom Research Paper | 10.13140/RG.2.2.26858.17603)
- β 4 Breakthrough Features: Glued Interactions, Zero-Copy Inference, 3D Holographic UI, Dual Corrector Pipeline
- β Production Architecture: Optimized for real-time voice processing
- β Enterprise Standards: Apache 2.0 licensing, security best practices, comprehensive documentation
- β Open-Source Governance: Clear guidelines, Git LFS setup, modular design
- π¬ 2,116 template responses
- π 1,806 knowledge facts
- π‘ 325 project ideas
- π¨ 50+ 3D equipment models
- β‘ <2s end-to-end latency
- π 100+ concurrent users supported
- π Apache 2.0 licensed
AXIOM integrates with complementary systems for enhanced functionality:
- WiredBrain RAG - Powers AXIOM's semantic retrieval layer with a high-performance RAG pipeline. Provides the knowledge base infrastructure for equipment specifications, technical documentation, and project recommendations.
AXIOM serves as the voice interface layer, while WiredBrain handles the underlying knowledge retrieval and semantic search operations.
Current model storage uses .pkl format for legacy compatibility with certain fine-tuned checkpoints. This introduces potential security risks when loading untrusted models.
Planned Migration (Q1 2026):
- Transition all model weights to
.safetensorsformat - Eliminates arbitrary code execution vulnerabilities
- Maintains backward compatibility via conversion utilities
- Full implementation tracked in [Issue #XX]
Current Deployment Recommendation: Run AXIOM in isolated environments (containers, VMs, or dedicated hardware) until the migration is complete. Do not load external model files without verifying their source.
Built on the shoulders of open-source foundations:
- Sherpa-ONNX - Speech recognition engine
- SetFit - Intent classification framework
- Sentence-Transformers - Semantic similarity search
- Ollama - Local LLM inference
- FastAPI - Web framework
- Kokoro - Text-to-speech synthesis
Built with β€οΈ for the robotics & AI community
For questions, contributions, or ideas, visit our GitHub repository
Contact: devcoder29cse@gmail.com | University Email: 251030181@juitsolan.in
Author: Shubham Dev, Department of Computer Science & Engineering, Jaypee University of Information Technology




