Skip to content

pheonix-delta/axiom-voice-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

AXIOM - Advanced Voice Agent with Conversational Intelligence

DOI Read Paper License: Apache 2.0

Python 3.10+

FastAPI

AXIOM Mascot


πŸš€ Update: Trending on r/LocalLLaMA & r/selfhosted!

Adoption: 1000+ Clones | Growing rapidly in the robotics & AI community

⭐ Star this repo! With 50,000+ views on Reddit and interest from the Hacker News (YC) community, AXIOM is proving that you don't need massive compute for high-end AI. Help me show the world that optimized edge-native agents (4GB VRAM) can outperform the cloud.


AXIOM is a production-grade, fully offline voice agent...

A production-grade voice-first AI system for robotics labs. Combines real-time speech processing, intelligent intent classification, RAG-powered responses, and interactive 3D visualizationβ€”all running locally with sub-400-ms latency.

Community

Live Demos

πŸ–₯️ Web Interface Screenshots

AXIOM Web Interface - Main View
Interactive carousel with equipment cards and voice agent

AXIOM Web Interface - Equipment Details
Detailed equipment specifications and 3D models

AXIOM Web Interface - Voice Interaction
Real-time voice interaction with visual feedback

Overview

AXIOM is a sophisticated voice agent built for robotics lab environments. It combines modern ML techniques with efficient inference pipelines to deliver:

  • Instant Voice Interaction: Real-time speech processing with WebSocket communication
  • Intelligent Intent Classification: SetFit-based intent recognition using secure .safetensors with 88%+ confidence thresholds. Eliminated pickle-based security risks with manual tensor inference.
  • Context-Aware Responses: Semantic RAG with 2,116+ template responses
  • 3D Interactive UI: WebGL-based carousel for visual equipment interaction
  • Multi-turn Conversation: FIFO history management for contextual understanding
  • Sub-2s Latency: Optimized for real-time conversational experience
  • Clean TTS Output: Phonetic + minimal safe correctors (5m β†’ 5 meters)
  • Future-Ready Training: Interaction DB logs corrections for continuous improvement

⭐ Four Breakthrough Features

  1. πŸ”— Glued Interactions - Context-aware multi-turn dialogue with 5-interaction FIFO history (stores conversation context for natural coherence)
  2. ⚑ Zero-Copy Inference - Direct tensor streaming from STT to LLM (94% memory reduction, 2.4% latency improvement)
  3. 🎨 3D Holographic UI - Interactive WebGL carousel with GPU-optimized lazy loading (streaming + progressive model loading)
  4. πŸ—£οΈ Dual Corrector Pipeline - Phonetic + minimal safe correctors for clean, natural TTS output

πŸ“Š Real Benchmark Proof (Measured)

Latency Benchmarks

Detailed Performance Table

🧭 Architecture & Innovation Visuals

System Architecture

Innovation Matrix

Performance Metrics

Quantitative analysis of AXIOM's response pipeline across different query types:

Performance Analysis
Component-level latency breakdown and system throughput metrics

Response Time Distribution
End-to-end response time analysis across intent categories

Terminal Demo

See AXIOM in action with real voice interactions and system logs:

Citation

If you use this project in research, please cite the DOI:

@misc{axiom_voice_agent_2024,
  title        = {AXIOM: Advanced Voice Agent with Conversational Intelligence},
  author       = {Shubham Dev},
  year         = {2024},
  doi          = {10.13140/RG.2.2.26858.17603},
  url          = {https://doi.org/10.13140/RG.2.2.26858.17603}
}

πŸ“‹ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Browser (Web UI)   β”‚
β”‚  - Voice Capture    β”‚
β”‚  - 3D Visualization β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚ WebSocket
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         FastAPI Backend Server           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€ STT Pipeline ─────────────────────┐  β”‚
β”‚ β”‚ β€’ Sherpa-ONNX Parakeet             β”‚  β”‚
β”‚ β”‚ β€’ Silero VAD (Voice Detection)     β”‚  β”‚
β”‚ β”‚ β€’ Phonetic + Minimal Safe Correctorβ”‚  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚ β”Œβ”€ Intent Classification ────────────┐  β”‚
β”‚ β”‚ β€’ SetFit Model (Local inference)   β”‚  β”‚
β”‚ β”‚ β€’ 15+ Intent classes               β”‚  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚ β”Œβ”€ Response Pipeline ────────────────┐  β”‚
β”‚ β”‚ β€’ Template-based bypass (80% QPS)  β”‚  β”‚
β”‚ β”‚ β€’ Semantic RAG handler             β”‚  β”‚
β”‚ β”‚ β€’ Ollama LLM fallback              β”‚  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚ β”Œβ”€ TTS Engine ───────────────────────┐  β”‚
β”‚ β”‚ β€’ Kokoro TTS (Sherpa-ONNX)         β”‚  β”‚
β”‚ β”‚ β€’ Sequential queue (no echo)       β”‚  β”‚
β”‚ β”‚ β€’ TTS-safe text normalization       β”‚  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓ (Data Persistence)
   SQLite Database
   (Conversation History)

System Architecture

High-Level Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Browser (Web UI)                             β”‚
β”‚   β€’ Voice Capture (MediaDevices)  β€’ 3D WebGL Carousel          β”‚
β”‚   β€’ Real-time Waveform Display    β€’ Equipment Visualization    β”‚
└──────────────────────────────────┬────────────────────────────-β”˜
                                   β”‚ WebSocket (Binary + JSON)
                                   ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              FastAPI Backend (main_agent_web.py)                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  INPUT β†’ [STT] β†’ [Intent] β†’ [Response] β†’ [TTS] β†’ OUTPUT         β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚--
β”‚  β”‚ 1. SPEECH-TO-TEXT (STT)                                 β”‚  β”‚
β”‚  β”‚    β€’ Model: Sherpa-ONNX Parakeet-TDT (200MB)            β”‚  β”‚
β”‚  β”‚    β€’ Speed: <100ms inference                            β”‚  β”‚
β”‚  β”‚    β€’ Tech: Transducer-based streaming recognition       β”‚  β”‚
β”‚  β”‚    β€’ File: backend/stt_handler.py                       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚-
β”‚                          ↓                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ 2. INTENT CLASSIFICATION                                β”‚  β”‚
β”‚  β”‚    β€’ Model: SetFit (Secure `model_head.safetensors` migration)β”‚  β”‚
β”‚  β”‚    β€’ Speed: <50ms inference                             β”‚  β”‚
β”‚  β”‚    β€’ Labels: equipment_query, project_ideas, etc. (9)   β”‚  β”‚
β”‚  β”‚    β€’ Security: Zero-copy manual tensor math (No Pickle)  β”‚  β”‚
β”‚  β”‚    β€’ File: backend/intent_classifier.py                 β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                          ↓                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ 3. CONTEXT INJECTION (Glued Interactions)               β”‚  β”‚
β”‚  β”‚    β€’ Stores: Last 5 interactions in SQLite              β”‚  β”‚
β”‚  β”‚    β€’ Injects: Previous context into LLM prompt          β”‚  β”‚
β”‚  β”‚    β€’ Benefit: Natural multi-turn dialogue               β”‚  β”‚
β”‚  β”‚    β€’ File: backend/conversation_manager.py              β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                          ↓                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ 4. RESPONSE GENERATION                                  β”‚  β”‚
β”‚  β”‚    β”Œβ”€ 80% TEMPLATE PATH (Fast)                          β”‚  β”‚
β”‚  β”‚    β”‚  β€’ 2,116 pre-generated responses                   β”‚  β”‚
β”‚  β”‚    β”‚  β€’ <10ms latency, 100% deterministic               β”‚  β”‚
β”‚  β”‚    β”‚  β€’ Covers common equipment queries                 β”‚  β”‚
β”‚  β”‚    β”‚                                                    β”‚  β”‚
β”‚  β”‚    └─ 20% RAG+LLM PATH (Intelligent)                    β”‚  β”‚
β”‚  β”‚       β€’ Semantic RAG: Searches knowledge bases          β”‚  β”‚
β”‚  β”‚       β€’ LLM: Ollama with drobotics_test model           β”‚  β”‚
β”‚  β”‚       β€’ Sources: 1,806 facts + 325 project ideas        β”‚  β”‚
β”‚  β”‚       β€’ Latency: ~100-500ms                             β”‚  β”‚
β”‚  β”‚                                                         β”‚  β”‚
β”‚  β”‚    File: backend/semantic_rag_handler.py                β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                          ↓                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ 5. TEXT-TO-SPEECH (TTS)                                 β”‚  β”‚
β”‚  β”‚    β€’ Model: Kokoro-EN (Sherpa-ONNX based, 150MB)        β”‚  β”‚
β”‚  β”‚    β€’ Speed: <200ms per sentence                         β”‚  β”‚
β”‚  β”‚    β€’ Tech: Sequential FIFO queue (prevents echo)        β”‚  β”‚
β”‚  β”‚    β€’ File: backend/sequential_tts_handler.py            β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                          ↓                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ 6. 3D MODEL MAPPING                                     β”‚  β”‚
β”‚  β”‚    β€’ Keyword Extraction: equipment names                β”‚  β”‚
β”‚  β”‚    β€’ Carousel Trigger: robot_dog β†’ unitree_go2.glb      β”‚  β”‚
β”‚  β”‚    β€’ Files: backend/keyword_mapper.py                   β”‚  β”‚
β”‚  β”‚           backend/model_3d_mapper.py                    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€-
β”‚                   Data Layer (Persistent)                      β”‚
β”‚  β€’ SQLite: Conversation history (data/web_interaction_*.db)    β”‚
β”‚  β€’ JSON: Knowledge bases (data/*.json)                         β”‚
β”‚  β€’ Static: 3D models (assets/3d v2/*.glb)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Responsibilities

Component Purpose Tech Stack
STT Handler Convert audio β†’ text Sherpa-ONNX + Silero VAD

| Intent Classifier | Detect user intent | SetFit (sentence-transformers) |

| RAG Handler | Search knowledge bases | Sentence-Transformers embeddings |

| Conversation Manager | Maintain context | Python deque + SQLite |

| Template Responses | Fast replies | 2,116 JSON templates |

| Ollama Interface | Complex queries | Ollama + drobotics_test model |

| TTS Handler | Generate speech | Kokoro-EN (Sherpa-ONNX) |

| 3D Mapper | Equipment β†’ GLB files | Keyword extraction |

| WebSocket Server | Real-time communication | FastAPI + uvicorn |

πŸ—£οΈ Response Quality (Unique Feature)

  • Phonetic Corrector: TTS-friendly conversion of units and domain terms
    • Example: "5m" β†’ "5 meters", "jetson nano" β†’ "Jetson Nano"
  • Minimal Safe Corrector: Removes markdown/noise without changing meaning
    • Example: **bold**, *italic*, `code` β†’ plain text
  • Template Bypass: Short, verified replies when confidence is high
    • Saves GPU/LLM resources and improves latency

πŸš€ Quick Start

Prerequisites

  • Python: 3.10+
  • RAM: 8GB minimum (16GB recommended)
  • VRAM: 2-3.6GB for GPU acceleration (optionalβ€”CPU mode works too)
  • Disk: 1GB for models (Kokoro, Sherpa, SetFit)

Step 1: Clone & Setup

# Clone repository
git clone https://github.com/pheonix-delta/axiom-voice-agent.git
cd axiom-voice-agent

# Create virtual environment (recommended name: axiomvenv)
python3 -m venv axiomvenv
source axiomvenv/bin/activate  # Linux/Mac
# or
axiomvenv\Scripts\activate  # Windows

# Install dependencies (avoid --break-system-packages; use the venv)
pip install -r requirements.txt

Step 2: Download Models (First Run Only)

Models are symlinked from your system. Verify they're accessible:

# Check symlinks
ls -la models/
# Output should show:
# kokoro-en-v0_19 -> ../../kokoro-en-v0_19
# sherpa-onnx-... -> ../../sherpa-onnx-...

# If symlinks are broken, set environment variables:
export KOKORO_PATH=/path/to/kokoro-en-v0_19
export SHERPA_PATH=/path/to/sherpa-onnx-...

πŸ“– See MODEL_PATH_RESOLUTION.md for complete setup options:

  • Environment variables (recommended)
  • Creating symlinks
  • Configuration files (.env)
  • Troubleshooting broken paths

Step 3: Start the Server

cd backend
python main_agent_web.py

# Output:
# INFO:     Application startup complete
# INFO:     Uvicorn running on http://0.0.0.0:8000

Step 4: Open Browser

Navigate to:

http://localhost:8000

πŸŽ™οΈ Click the microphone icon and start speaking!

⚠️ Important: Use localhost or 127.0.0.1 (not IP addresses) for browser microphone permissions.


πŸ“ Project Structure

axiom-voice-agent/                        # Root directory
β”‚
β”œβ”€β”€ πŸš€ QUICK START
β”‚   β”œβ”€β”€ README.md                         # ← You are here
β”‚   β”œβ”€β”€ QUICK_START.md                   # Detailed feature walkthrough
β”‚   └── PRE_PUBLICATION_CHECKLIST.md      # OSS deployment checklist
β”‚
β”œβ”€β”€ πŸ“š DOCUMENTATION
β”‚   β”œβ”€β”€ docs/ARCHITECTURE.md              # Complete system design
β”‚   β”œβ”€β”€ OSS_DEPLOYMENT_GUIDE.md          # Symlinks, SetFit, Git LFS, licensing
β”‚   β”œβ”€β”€ CONTRIBUTING.md                  # Contributor guidelines
β”‚   β”œβ”€β”€ SECURITY.md                      # Vulnerability disclosure
β”‚   β”œβ”€β”€ SYSTEM_SANITY_AND_OSS_READINESS_REPORT.md
β”‚   β”œβ”€β”€ QUICK_REFERENCE_QA.md            # FAQ for symlinks, SetFit, license
β”‚   └── LICENSE                          # Apache 2.0 license
β”‚
β”œβ”€β”€ πŸ”§ BACKEND (Python)
β”‚   β”œβ”€β”€ backend/
β”‚   β”‚   β”œβ”€β”€ main_agent_web.py            # 🎯 START HERE: FastAPI + WebSocket server
β”‚   β”‚   β”œβ”€β”€ stt_handler.py               # Speech-to-Text (Sherpa-ONNX)
β”‚   β”‚   β”œβ”€β”€ intent_classifier.py         # Intent detection (SetFit)
β”‚   β”‚   β”œβ”€β”€ semantic_rag_handler.py      # RAG search + Ollama LLM
β”‚   β”‚   β”œβ”€β”€ sequential_tts_handler.py    # Text-to-Speech (Kokoro)
β”‚   β”‚   β”œβ”€β”€ conversation_manager.py      # πŸ”— Glued Interactions (context history)
β”‚   β”‚   β”œβ”€β”€ conversation_orchestrator.py # Context injection into LLM
β”‚   β”‚   β”œβ”€β”€ template_responses.py        # 2,116 pre-generated responses
β”‚   β”‚   β”œβ”€β”€ model_3d_mapper.py          # Equipment name β†’ GLB file mapping
β”‚   β”‚   β”œβ”€β”€ keyword_mapper.py           # Extract equipment names from text
β”‚   β”‚   β”œβ”€β”€ vad_handler.py              # Voice Activity Detection (Silero)
β”‚   β”‚   β”œβ”€β”€ axiom_brain.py              # Ollama interface
β”‚   β”‚   β”œβ”€β”€ config.py                   # Centralized path configuration
β”‚   β”‚   └── [other handlers...]         # Vocabulary, minimal corrections, etc.
β”‚   └── requirements.txt                 # Python dependencies
β”‚
β”œβ”€β”€ 🎨 FRONTEND (Web UI)
β”‚   β”œβ”€β”€ frontend/
β”‚   β”‚   β”œβ”€β”€ voice-carousel-integrated.html    # 🎯 START HERE: Web UI + 3D carousel
β”‚   β”‚   └── audio-capture-processor.js        # Audio streaming + WebSocket
β”‚   └── assets/3d v2/                         # 3D equipment models (GLB format)
β”‚       β”œβ”€β”€ robot_dog_unitree_go2.glb        # Quadruped robot (2.5MB)
β”‚       β”œβ”€β”€ jetson_orin.glb                  # AI computer
β”‚       β”œβ”€β”€ lidar_sensor.glb                 # Sensor visualization
β”‚       └── [50+ more equipment models...]
β”‚
β”œβ”€β”€ 🧠 MODELS (Pre-trained, Symlinked)
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ kokoro-en-v0_19/            # TTS model (symlink β†’ ../../kokoro-en-v0_19)
β”‚   β”‚   β”œβ”€β”€ sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/  # STT (symlink)
β”‚   β”‚   β”œβ”€β”€ intent_model/
β”‚   β”‚   β”‚   └── setfit_intent_classifier/    # SetFit intent classifier (30MB, Git-tracked)
β”‚   β”‚   β”œβ”€β”€ silero_vad.onnx                  # Voice detection (40MB)
β”‚   β”‚   β”œβ”€β”€ Modelfile.drobotics_test         # Ollama model recipe
β”‚   β”‚   └── DROBOTICS_TEST.md               # Model documentation
β”‚   └── Note: Large models are symlinked from parent dir to avoid duplication
β”‚
β”œβ”€β”€ πŸ“Š DATA (Knowledge Bases)
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ template_database.json           # 2,116 Q&A template responses
β”‚   β”‚   β”œβ”€β”€ rag_knowledge_base.json          # 1,806 technical facts
β”‚   β”‚   β”œβ”€β”€ project_ideas_rag.json           # 325 robotics project suggestions
β”‚   β”‚   β”œβ”€β”€ inventory.json                   # 27 equipment specifications
β”‚   β”‚   β”œβ”€β”€ carousel_mapping.json            # Keyword β†’ GLB file mappings
β”‚   β”‚   └── web_interaction_history.db       # SQLite: Conversation history
β”‚   └── Note: All data files are flat JSON (easy to edit, extend, version control)
β”‚
β”œβ”€β”€ ⭐ SPECIAL FEATURES (Innovation Demos)
β”‚   β”œβ”€β”€ special_features/
β”‚   β”‚   β”œβ”€β”€ GLUED_INTERACTIONS_DEMO.md      # Multi-turn context demo
β”‚   β”‚   β”œβ”€β”€ ZERO_COPY_INFERENCE.md         # Memory optimization details
β”‚   β”‚   β”œβ”€β”€ 3D_HOLOGRAPHIC_UI.md           # 3D frontend architecture
β”‚   β”‚   β”œβ”€β”€ test_glued_interactions.py      # Test script for context injection
β”‚   β”‚   └── README.md                       # Feature validation guide
β”‚   └── Note: See achievements/ for innovation analysis
β”‚
β”œβ”€β”€ πŸ”¬ RESEARCH & TRAINING
β”‚   β”œβ”€β”€ setfit_training/                    # SetFit model training scripts
β”‚   β”‚   β”œβ”€β”€ scripts/                        # Training pipeline
β”‚   β”‚   └── generated/                      # Training datasets
β”‚   β”œβ”€β”€ research/                           # Design decisions
β”‚   β”œβ”€β”€ benchmarks/                         # Performance metrics
β”‚   └── Note: Model training is reproducibleβ€”retrain anytime
β”‚
└── πŸ“‹ ROOT FILES
    β”œβ”€β”€ FEATURES.md                         # Feature matrix
    β”œβ”€β”€ ACHIEVEMENTS_AND_INNOVATION.md      # Innovation documentation
    β”œβ”€β”€ PATH_FIX_SUMMARY.md                 # Path integrity notes (for reference)
    β”œβ”€β”€ requirements.txt                    # Python dependencies
    └── .gitignore                          # Git ignore patterns (includes .env)

Key Files to Edit When Extending

Task File What to Do
Add new equipment response data/template_database.json Add {"intent": "...", "response": "..."}
Add new technical fact data/rag_knowledge_base.json Add {"topic": "...", "fact": "..."}
Add new project idea data/project_ideas_rag.json Add project object
Add new equipment specs data/inventory.json Add equipment object
Map new equipment to 3D model data/carousel_mapping.json Add {"keyword": "name", "glb_file": "file.glb"}
Add new intent labels Retrain SetFit See setfit_training/scripts/train_production_setfit.py
Add custom environment variables backend/config.py Add os.getenv() call

πŸ“– Documentation Roadmap

Document Purpose For Whom
README.md (this file) Overview + quick start Everyone
QUICK_START.md Feature walkthrough + examples Users trying features
docs/ARCHITECTURE.md Complete system design Developers, architects
OSS_DEPLOYMENT_GUIDE.md Symlinks, SetFit, licensing Open-source maintainers
CONTRIBUTING.md Contributor guidelines Code contributors
SECURITY.md Vulnerability disclosure Security researchers
QUICK_REFERENCE_QA.md FAQ (symlinks, SetFit, license) Quick answers
special_features/ Innovation deep-dives Advanced users

⭐ Breakthrough Features Deep Dive

πŸ”— Feature 1: Glued Interactions (Context-Aware Multi-Turn Dialogue)

Problem: Voice bots typically treat each query as isolated, lacking conversation context.

Solution: Maintain a FIFO queue of last 5 interactions, inject context into LLM prompts.

User 1: "Tell me about Jetson Orin"
  β†’ Stored: {query, intent, response, confidence, timestamp}
  
User 2: "Does it support cameras?"
  WITHOUT context: "I don't know what 'it' refers to"
  WITH context (LLM sees): "Earlier we discussed Jetson Orin with 12GB memory..."
  β†’ Response: "Yes, Jetson Orin supports RealSense D435i cameras..."

Implementation:

  • Storage: SQLite database (data/web_interaction_history.db)
  • Manager: backend/conversation_manager.py (Python deque, max 5 items)
  • Injector: backend/conversation_orchestrator.py (context in LLM system prompt)
  • Impact: +100ms latency for dramatically improved coherence
  • Testing: python special_features/test_glued_interactions.py

⚑ Feature 2: Zero-Copy Inference (Direct Tensor Streaming)

Problem: Traditional ML pipelines copy data 3+ times: STT β†’ String β†’ Tokens β†’ GPU (8.5MB per inference).

Solution: Use NumPy frombuffer() to stream STT output directly as GPU tensors (0 memory copies).

Traditional: STT β†’ String (COPY 1) β†’ Tokens (COPY 2) β†’ GPU (COPY 3) = 8.5MB
Zero-Copy:  STT β†’ String (same address) β†’ Tokens (same address) β†’ GPU (same address) = 0.5MB

Key Optimization:

# ❌ Creates memory copy
data = np.array(bytes_input)

# βœ… Creates memory view (zero-copy)
data = np.frombuffer(bytes_input, dtype=np.int16)

Benefits:

  • 94% memory reduction: 8.5MB β†’ 0.5MB per inference
  • 2.4% latency improvement: ~10ms faster
  • Scalability: Supports 100+ concurrent users on single instance
  • Implementation: backend/stt_handler.py (NumPy integration with Ollama)
  • Testing: python special_features/validate_zero_copy_inference.py

🎨 Feature 3: 3D Holographic UI (Dynamic Model Visualization)

Problem: Heavy 3D assets (~300MB) consume browser memory and network bandwidth.

Solution: Stream + lazy load models on-demand, keep max 3 in VRAM, auto-dealloc when off-screen.

User Interaction Flow

User: "Show me the robot dog"
  ↓ STT
"Show me the robot dog"
  ↓ Intent Detection
equipment_query
  ↓ Keyword Mapper
"robot dog"
  ↓ Model 3D Mapper
"robot_dog_unitree_go2.glb"
  ↓ Frontend Lazy Load
Model fetches from /3d v2/ (if not cached)
  ↓ WebGL Render
3D quadruped appears, auto-rotates

3D Heavy Frontend Management Strategy

Server-Side Delivery:

# backend/main_agent_web.py - Line 52
app.mount("/3d v2", StaticFiles(directory="/home/user/Desktop/voice agent/axiom-voice-agent/assets/3d v2"), name="3d_models")

---

### πŸ—£οΈ Feature 4: Dual Corrector Pipeline (Clean TTS Output)

**Problem**: Raw model output contains units, punctuation, and artifacts that sound wrong in speech.

**Solution**: Two-stage correction before TTS:
1. **Phonetic Corrector**: Expands units and domain terms (e.g., "5m" β†’ "5 meters")
2. **Minimal Safe Corrector**: Removes markdown/noise without changing meaning

**Implementation**:
- **Phonetic**: `backend/vocabulary_handler.py`
- **Minimal Safe**: `backend/minimal_safe_corrector.py`
- **Applied in**: `backend/sequential_tts_handler.py`

**Benefits**:
- Consistent speech pronunciation
- Fewer misreads of symbols/units
- Cleaner audio output for demos
  • HTTP delivery with gzip compression (40% reduction)
  • Browser caches frequently used models
  • Conditional requests (304 Not Modified) minimize transfer

Client-Side Lazy Loading:

// Load ONLY when visible
loadModelOnScroll() {
    if (cardVisible && !modelLoaded) {
        fetch('/3d v2/model.glb')
            .then(r => r.arrayBuffer())
            .then(buffer => GLTFLoader.parse(buffer))
            .then(model => scene.add(model))
    }
}

// Free GPU memory for off-screen models
onScrollOut() {
    scene.remove(model)
    geometry.dispose()  // Release VRAM
    material.dispose()
    texture.dispose()
}

GPU Memory Management:

  • Max Concurrent: 3 models in VRAM
  • Progressive: Pre-fetch adjacent cards
  • Auto-Dealloc: Off-screen cleanup
  • Cache: Browser + IndexedDB for offline

Network Efficiency:

Stage Time Size
Page Load 2-5s 50KB (no models)
First Render 0.5-1s 5-20MB (1-2 models)
Scrolling 60 FPS Max 3 in VRAM
Mobile Works <500MB available

Implementation:

  • Frontend: Google <model-viewer> web component (CDN-loaded)
  • Backend Mapping: backend/model_3d_mapper.py (keywordβ†’GLB)
  • Keyword Extraction: backend/keyword_mapper.py
  • Models: GLB format in assets/3d v2/
  • Testing: Start server β†’ Say equipment names β†’ Check DevTools Network tab

Supported Models:

robot dog / unitree go2  β†’ 3D quadruped
jetson                  β†’ AI computer
lidar                   β†’ Sensor visualization
raspberry pi            β†’ Single-board computer
(50+ more equipment models)

πŸ“Š Performance Comparison

Metric Traditional With Optimizations
STT Memory 150MB 150MB (same)
Inference Memory 8.5MB/call 0.5MB/call (94% reduction)
Total Latency ~2.5s ~2.0s (2.4% improvement)
3D Load Time 5+ mins (all models) 0.5s/model (lazy loading)
Concurrent Users 10-20 100+ (zero-copy benefit)
Context Quality Isolated queries Natural multi-turn (glued interactions)

1. Speech-to-Text (STT)

  • Model: Sherpa-ONNX (Parakeet-TDT, 0.6B quantized)
  • Inference: <100ms on CPU
  • Post-processing: Phonetic corrections for domain-specific terms

2. Intent Classification

  • Model: SetFit (fine-tuned on robotics domain)
  • Inference: <50ms
  • Coverage: 15 intent classes (equipment_query, project_ideas, etc.)
  • Threshold: 88%+ confidence for template bypass

3. Response Generation

  • 80% Template-Based: Fast, deterministic responses
  • 20% RAG+LLM: Complex queries using knowledge bases
  • RAG Sources:
    • Equipment specifications (27 items)
    • Technical knowledge (1,806 facts)
    • Project ideas (325 items)

4. Text-to-Speech (TTS)

  • Model: Kokoro-EN (Sherpa-ONNX based)
  • Inference: <200ms per sentence
  • Queue System: Prevents audio echo/overlap

πŸ”„ Data Flow Example

User: "Tell me about the robot dog"
  ↓
[VAD Detection] β†’ Voice detected βœ“
  ↓
[STT] β†’ "Tell me about the robot dog"
  ↓
[Intent Classifier] β†’ equipment_query (0.91 confidence)
  ↓
[Confidence Check] β†’ 0.91 > 0.88 βœ“
  ↓
[Template Handler] β†’ Retrieves pre-generated response
  ↓
[TTS] β†’ Streams audio to client
  ↓
[UI] β†’ Carousel highlights "Robot Dog" card + 3D model

🧠 Knowledge Bases (RAG)

Template Database (2,116 responses)

Extracted from training data, covers:

  • Equipment specifications
  • Lab procedures
  • Common troubleshooting
  • Project recommendations

RAG Knowledge Base (1,806 facts)

Organized by domain:

  • Mechanical systems
  • Electrical integration
  • Software frameworks
  • Best practices

Project Ideas (325 items)

Project suggestions indexed by:

  • Difficulty level
  • Equipment required
  • Estimated duration

🎨 Frontend Features

Real-time Visualization

  • 3D Model Carousel: WebGL rendering of equipment
  • Voice Waveform: Visual feedback during speech
  • Status Indicators: Intent confidence, processing state
  • Card Highlighting: Context-aware UI updates

Audio Processing

  • Browser MediaDevices API: Direct microphone access
  • WebSocket Streaming: 512-sample chunks (32kHz, Int16)
  • Client-side VAD: Reduces server load
  • Echo Cancellation: Built-in browser support

πŸ“Š Performance Metrics

Component Latency Memory VRAM
STT <100ms 150MB 200MB
Intent <50ms 80MB 100MB
Template <10ms 50MB -
RAG <100ms 200MB 500MB
TTS <200ms 120MB 300MB
Total <2s ~1GB ~3.6GB

πŸ”§ Configuration

Environment Variables (Optional)

Create .env file in backend/:

AXIOM_MODEL=drobotics_test
TTS_DEVICE=cuda  # or cpu
STT_NUM_THREADS=4

Model Paths

  • STT: models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/
  • TTS: models/kokoro-en-v0_19/
  • Intent: models/intent_model/setfit_intent_classifier/
  • VAD: models/silero_vad.onnx

πŸ“š API Reference

WebSocket Endpoint: /ws

Message Format:

{
  "type": "audio_chunk",
  "data": "<base64 encoded audio bytes>",
  "chunk_index": 42
}

Response:

{
  "type": "response",
  "text": "Here's information about the robot dog...",
  "intent": "equipment_query",
  "confidence": 0.91,
  "card_trigger": "robot_dog"
}

πŸ› οΈ Development

Adding New Intents

  1. Add examples to training data
  2. Retrain SetFit model: python models/train_setfit.py
  3. Update template_database.json with new responses

Extending Knowledge Base

  1. Add facts to data/rag_knowledge_base.json
  2. Update templates in data/template_database.json
  3. (Optional) Retrain semantic embeddings

Debugging

# Check model loading
python -c "from backend.intent_classifier import IntentClassifier; ic = IntentClassifier(); print(ic.labels)"

# Test STT
python -c "from backend.stt_handler import STTHandler; stt = STTHandler(); print('STT ready')"

# View conversation history
sqlite3 data/web_interaction_history.db "SELECT * FROM interactions LIMIT 5;"

πŸ“ˆ Scalability Notes

  • Session Management: One connection per user (can scale to 100+ concurrent users with proper resource allocation)
  • Model Caching: Models are loaded once at startup
  • Database: SQLite suitable for <10K interactions/day
  • For Production: Consider PostgreSQL, Redis caching, load balancing

πŸ› Troubleshooting Guide

Problem: Microphone Not Working

Symptoms: Browser shows "No microphone permission" or microphone appears inactive.

Solutions:

  1. Use localhost, not IP addresses

    • ❌ http://192.168.1.100:8000 (won't work)
    • βœ… http://localhost:8000 (works)
    • βœ… http://127.0.0.1:8000 (works)
  2. Check browser microphone permissions

    • Click padlock icon in address bar
    • Ensure "Microphone" is set to "Allow"
    • Refresh page
  3. Test microphone in system settings

    • Linux: pavucontrol or alsamixer
    • macOS: System Preferences β†’ Sound β†’ Input
    • Windows: Settings β†’ Sound β†’ Volume levels

Problem: Models Not Loading

Symptoms: Error like "Model not found" or "No such file or directory"

Solutions:

# 1. Check symlinks
cd models/
ls -la  # Should show: kokoro-en-v0_19 -> ../../kokoro-en-v0_19

# 2. If symlinks are broken, verify parent directories exist
ls -la ../../kokoro-en-v0_19/
ls -la ../../sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/

# 3. If parent dirs don't exist, set environment variables
export KOKORO_PATH=/path/to/kokoro-en-v0_19
export SHERPA_PATH=/path/to/sherpa-onnx-...
python main_agent_web.py

# 4. See OSS_DEPLOYMENT_GUIDE.md Section 2 for complete symlink setup

Problem: High Latency / Slow Response

Symptoms: 5+ second delay before hearing response.

Solutions:

  1. Check GPU memory

    nvidia-smi  # Should show < 80% usage
    • If near 100%, reduce concurrent clients or use CPU mode
  2. Use template-based responses (faster)

    • Ask about equipment specs (equipment_query intent)
    • 80% of queries should trigger fast templates
  3. Check CPU load

    top -p $(pgrep -f "python main_agent_web.py")
    • If > 90%, server is overloaded
  4. Use fewer concurrent connections

    • Each WebSocket connection uses ~500MB RAM
    • Max ~20-30 concurrent on typical hardware

Problem: Audio Cutting Out / Echo

Symptoms: Robotic voice overlapping or stuttering audio.

Solutions:

  1. Sequential TTS Queue (prevents echo)

    • Already built-in (backend/sequential_tts_handler.py)
    • If still happening, check browser console for errors
  2. Reduce microphone input level

    • System Settings β†’ Sound β†’ Input volume at 70-80%
  3. Restart server

    # Stop: Ctrl+C
    python main_agent_web.py  # Restart

Problem: Database Errors

Symptoms: SQLite locked error or corrupt database.

Solutions:

# 1. Reset conversation history
rm data/web_interaction_history.db

# 2. Or check database integrity
sqlite3 data/web_interaction_history.db "PRAGMA integrity_check;"

# 3. Restart server (will auto-create fresh database)
python main_agent_web.py

Problem: SetFit Model Not Loading

Symptoms: Error about "setfit_intent_classifier not found"

Solutions:

# 1. Verify SetFit is installed
pip install setfit>=1.0.3

# 2. Check model directory
ls -la models/intent_model/setfit_intent_classifier/
# Should contain: config.json, model.safetensors, etc.

# 3. Verify it's in requirements.txt
grep "setfit" requirements.txt

Problem: 3D Models Not Showing

Symptoms: Empty carousel or "Failed to load model" in console.

Solutions:

# 1. Check 3D assets directory
ls -la assets/3d\ v2/*.glb | head -5
# Should show .glb files

# 2. Test model loading from server
curl http://localhost:8000/3d\ v2/robot_dog_unitree_go2.glb -I
# Should return 200 OK

# 3. Check browser console (F12)
# Look for 404 errors on /3d v2/ URLs

πŸ› Troubleshooting

πŸŽ“ Model Attribution & Licensing

Base Models & Fine-tuning

Component Model Base License Attribution Notes
LLM Llama 3.2 3B Meta Community Meta AI Fine-tuned as drobotics_test
STT Sherpa-ONNX Parakeet-TDT 0.6B Apache 2.0 Xiaoomi Wenet Quantized INT8
TTS Kokoro-EN Apache 2.0 LJSpeech Sherpa-ONNX optimized
Intent Classification SetFit Apache 2.0 Hugging Face 9 robotics intents
Semantic Search All-MiniLM-L6-v2 Apache 2.0 Sentence-Transformers RAG embeddings
VAD Silero VAD MIT Silero AI Voice activity detection

Project License

AXIOM Voice Agent is licensed under Apache 2.0.

Copyright 2024-2026 AXIOM Contributors
Licensed under the Apache License, Version 2.0
See LICENSE file for full terms

What This Means:

  • βœ… Free for Commercial Use: Build products on top of AXIOM
  • βœ… Open Source: Source code available for modification
  • βœ… Patent Protection: Explicit patent grant included
  • βœ… Attribution Required: Must include LICENSE + acknowledge changes
  • βœ… Derivatives Allowed: Modifications can be kept private
  • βœ… No Warranty: Use at your own risk

🀝 How to Contribute

For Code Contributors

  1. Fork the repository
  2. Create branch: git checkout -b feature/your-feature
  3. Make changes (follow CONTRIBUTING.md style guide)
  4. Test and document
  5. Submit pull request with description

See CONTRIBUTING.md for detailed guidelines.

For Security Issues

Do NOT open public issues. See SECURITY.md for responsible disclosure.

For Questions


πŸ“ž Support Resources

πŸ“š Documentation

πŸ› οΈ Development

πŸ”§ Getting Help

  1. Check the docs first (linked above)
  2. Search existing issues on GitHub
  3. Ask in GitHub Discussions with clear context
  4. Report bugs with reproduction steps + OS details

🌟 Featured In

This project demonstrates:

  • βœ… Shubham Dev: Primary architect (Axiom Research Paper | 10.13140/RG.2.2.26858.17603)
  • βœ… 4 Breakthrough Features: Glued Interactions, Zero-Copy Inference, 3D Holographic UI, Dual Corrector Pipeline
  • βœ… Production Architecture: Optimized for real-time voice processing
  • βœ… Enterprise Standards: Apache 2.0 licensing, security best practices, comprehensive documentation
  • βœ… Open-Source Governance: Clear guidelines, Git LFS setup, modular design

πŸ“Š Quick Stats

  • πŸ’¬ 2,116 template responses
  • πŸ“š 1,806 knowledge facts
  • πŸ’‘ 325 project ideas
  • 🎨 50+ 3D equipment models
  • ⚑ <2s end-to-end latency
  • πŸš€ 100+ concurrent users supported
  • πŸ”’ Apache 2.0 licensed

οΏ½ Related Projects

AXIOM integrates with complementary systems for enhanced functionality:

  • WiredBrain RAG - Powers AXIOM's semantic retrieval layer with a high-performance RAG pipeline. Provides the knowledge base infrastructure for equipment specifications, technical documentation, and project recommendations.

AXIOM serves as the voice interface layer, while WiredBrain handles the underlying knowledge retrieval and semantic search operations.


πŸ›‘οΈ Security & Development Roadmap

Model Format Migration

Current model storage uses .pkl format for legacy compatibility with certain fine-tuned checkpoints. This introduces potential security risks when loading untrusted models.

Planned Migration (Q1 2026):

  • Transition all model weights to .safetensors format
  • Eliminates arbitrary code execution vulnerabilities
  • Maintains backward compatibility via conversion utilities
  • Full implementation tracked in [Issue #XX]

Current Deployment Recommendation: Run AXIOM in isolated environments (containers, VMs, or dedicated hardware) until the migration is complete. Do not load external model files without verifying their source.


πŸ™ Acknowledgments

Built on the shoulders of open-source foundations:

  • Sherpa-ONNX - Speech recognition engine
  • SetFit - Intent classification framework
  • Sentence-Transformers - Semantic similarity search
  • Ollama - Local LLM inference
  • FastAPI - Web framework
  • Kokoro - Text-to-speech synthesis

Built with ❀️ for the robotics & AI community

For questions, contributions, or ideas, visit our GitHub repository

Contact: devcoder29cse@gmail.com | University Email: 251030181@juitsolan.in

Author: Shubham Dev, Department of Computer Science & Engineering, Jaypee University of Information Technology