A production-ready multimodal AI system combining real-time computer vision, intelligent analytics, LLM-powered reasoning, and voice interaction. Built with YOLOv8, OpenRouter free models, Vosk, and pyttsx3 for serious human-computer interaction.
- Features
- Architecture
- Requirements
- Installation
- Configuration
- Usage
- Voice Commands
- Performance Metrics
- API Reference
- Multimodal Architecture
- Troubleshooting
- Use Cases
- Contributing
- License
- Real-time Object Detection: YOLOv8-powered detection with configurable confidence thresholds
- Intelligent Analytics: Priority scoring based on object size and position
- LLM Reasoning: OpenRouter API with free models (Llama-3.3-70b, Mistral-7b)
- Voice Input: Offline speech recognition using Vosk
- Voice Output: Text-to-speech synthesis with pyttsx3
- Conversational AI: Natural language scene understanding and question answering
- Multimodal I/O: Vision + Voice + LLM integration
- Non-blocking Architecture: Voice and LLM run parallel to vision processing
- Performance Monitoring: FPS and latency measurement
- Robust Error Handling: Graceful failure recovery and fallback modes
- Production-Ready: Modular design with proper separation of concerns
Camera β YOLO (Perception) β Analytics (Meaning) β LLM (Reasoning) β Answer
β β β β
Voice Input β STT (Vosk) β User β TTS (pyttsx3) β Response β OpenRouter
-
Vision Pipeline (Real-time Loop - Main Thread):
- Camera capture β Preprocessing β YOLO inference β Post-processing β Analytics β Display
- Target: 30+ FPS
-
Voice Pipeline (Parallel Thread - Non-blocking):
- Microphone β Speech-to-Text (Vosk) β Question
- Scene Summary β LLM (OpenRouter) β Answer
- Answer β Text-to-Speech (pyttsx3) β Speaker
-
Integration:
- Vision continuously updates shared scene summary
- Voice queries latest scene state without blocking vision
- LLM provides intelligent, grounded responses
ObjectDetector: YOLOv8 wrapper with configurable confidence thresholdsObjectAnalytics: Priority scoring and structured analytics generation (LLM-optimized)LLMEngine: OpenRouter client with prompt engineering for grounded reasoningVoiceEngine: Offline speech recognition and synthesisVideoAnalyticsApp: Main application orchestrating all components with threading
- Python 3.12+
- Webcam or video input device
- Microphone for voice input
- Speakers/headphones for voice output
- ~2GB RAM for YOLOv8n model
- CPU/GPU with CUDA support (optional, for faster inference)
- OpenRouter API key (free - see Configuration)
-
Clone the repository:
git clone <repository-url> cd real-time-object-analytics-engine-for-live-video-streams
-
Install dependencies:
pip install -r requirements.txt
-
Download YOLOv8 model (included):
yolov8n.ptis already included in the repository
-
Download Vosk speech recognition model:
- Visit: https://alphacephei.com/vosk/models
- Download:
vosk-model-small-en-us-0.15.zip - Extract to:
object_analytics/models/vosk-model-small-en-us-0.15/
# After downloading and extracting ls object_analytics/models/ # Should show: vosk-model-small-en-us-0.15/
-
Get OpenRouter API Key (FREE):
- Visit: https://openrouter.ai/keys
- Sign up for a free account
- Generate an API key
- Copy
.env.exampleto.env:
cp .env.example .env
- Edit
.envand add your API key:
OPENROUTER_API_KEY=your_actual_key_here
Edit object_analytics/config.py to customize:
# Camera settings
CAMERA_INDEX = 0 # Default webcam
# Detection parameters
CONFIDENCE_THRESHOLD = 0.40 # Minimum confidence for detections
IMPORTANT_CLASSES = [0, 1, 2, 3, 5, 7] # person, car, truck, bus, bicycle, motorbike
# Display settings
DISPLAY_WINDOW_NAME = "JARVIS-lite: Real-Time Vision + LLM"
MAX_DISPLAY_DETECTIONS = 20Required:
# OpenRouter API key for LLM reasoning
export OPENROUTER_API_KEY="your_key_here" # Linux/Mac
set OPENROUTER_API_KEY=your_key_here # Windows CMD
$env:OPENROUTER_API_KEY="your_key_here" # Windows PowerShellOptional:
# Override default model (must be free)
export OPENROUTER_MODEL="mistralai/mistral-7b-instruct:free"Available Free Models:
meta-llama/llama-3.3-70b-instruct:free(default, best quality)mistralai/mistral-7b-instruct:free(fast, good quality)google/gemma-7b-it:free(Google model)nousresearch/nous-capybara-7b:free(alternative)
Run the complete JARVIS-lite system:
# Set your API key first
export OPENROUTER_API_KEY="your_key_here" # Linux/Mac
# OR
set OPENROUTER_API_KEY=your_key_here # Windows CMD
# OR
$env:OPENROUTER_API_KEY="your_key_here" # Windows PowerShell
# Run the system
python -m object_analytics.mainThe system will:
- Initialize YOLOv8 object detector
- Connect to OpenRouter LLM (free model)
- Load Vosk voice recognition model
- Open your webcam
- Start vision processing loop (30+ FPS)
- Start voice interaction thread (parallel)
- Display annotated video with FPS metrics
- Listen for voice commands continuously
- Respond with intelligent, scene-grounded answers
Press Ctrl+C or Q to exit gracefully.
The system will display:
- Live video feed with bounding boxes and labels
- Real-time FPS and latency metrics
- Detection confidence scores
- Voice interaction status in console
Console Output:
INFO - Initializing JARVIS-lite Multimodal AI System...
INFO - LLM reasoning engine initialized successfully
INFO - Voice engine initialized successfully
INFO - π€ JARVIS voice interaction loop started
INFO - π‘ Using LLM-powered reasoning via OpenRouter
INFO - Vision processing: 28.5 FPS
INFO - User: "What do you see?"
INFO - JARVIS: "I see 2 objects: a person with 85% confidence and a car with 92% confidence."
JARVIS-lite understands natural language questions about the current scene.
- "What do you see?" β Complete list of detected objects
- "How many objects?" β Total count
- "Describe the scene" β Detailed scene analysis
- "What is the most important object?" β Highest priority object
- "Tell me about the person" β Person-specific details
- "Is there a car?" β Presence detection
- "What's in the center?" β Center-focused analysis
- "Where is the main object?" β Position description
- "Is the path clear?" β Safety assessment
- "Should I pay attention to something?" β Priority advice
- "What should I focus on?" β Important object guidance
- "Is anything important happening?" β Event detection
The LLM provides intelligent, grounded answers based only on current detections.
Typical Performance (YOLOv8n on CPU):
- Vision FPS: 25-30 fps (maintained during voice interaction)
- LLM Latency: 1-3 seconds (via OpenRouter free tier)
- Voice Recognition: ~500ms (Vosk offline)
- Memory Usage: ~2.5GB RAM (includes LLM client)
Performance Factors:
- Hardware: GPU acceleration significantly improves FPS
- Resolution: Lower input resolution = higher FPS
- Model Size: Larger YOLO models = higher accuracy, lower FPS
- LLM Model: Different free models have different latencies
- Network: LLM calls require internet (vision/voice work offline)
LLM-powered reasoning engine using OpenRouter API.
Parameters:
api_key(str): OpenRouter API key (or set OPENROUTER_API_KEY env variable)model(str): Model identifier (default: meta-llama/llama-3.3-70b-instruct:free)max_tokens(int): Maximum response length (default: 150)temperature(float): Sampling temperature (default: 0.3 for factual responses)
Methods:
answer(scene_summary, question, timeout)β str: Generate grounded answertest_connection()β bool: Test OpenRouter API connectivity
Example:
llm = LLMEngine() # Uses OPENROUTER_API_KEY from environment
answer = llm.answer(summary, "What do you see?")
# Returns: "I see 2 objects: a person with 85% confidence and a car with 92% confidence."Generates structured analytics summary optimized for LLM integration.
Parameters:
analytics_data(list): List of analyzed detections
Returns:
dict: Summary with total objects and complete detection list:{ 'total_objects': int, 'detections': [ { 'class': str, 'confidence': float, 'priority': float, 'area_ratio': float, 'position': str, 'center': tuple, 'bbox': list }, ... ] }
Example:
summary = analytics_engine.summarize(analyzed_detections)
# Returns complete scene data for LLM reasoningOffline speech recognition and text-to-speech engine.
Parameters:
model_path(str): Path to Vosk model directory
Methods:
listen()β str: Captures and transcribes speech to textspeak(text)β None: Converts text to speech output
Example:
voice = VoiceEngine("models/vosk-model-small-en-us-0.15")
question = voice.listen() # Waits for speech
voice.speak("Hello, I can see objects in the scene")Main application class combining vision, analytics, LLM reasoning, and voice.
Methods:
run()β int: Starts the complete multimodal system
Example:
app = VideoAnalyticsApp()
exit_code = app.run() # Full JARVIS-lite experienceJARVIS-lite combines three parallel processing pipelines:
- Real-time Detection: YOLOv8 processes video frames at 30+ FPS
- Object Analytics: Extracts confidence scores, positions, and priorities
- Scene Summary: Maintains latest detection state for LLM reasoning
- Speech Recognition: Vosk processes audio input offline
- Text-to-Speech: pyttsx3 generates natural voice responses
- Non-blocking Operation: Runs independently to preserve vision FPS
- LLM Integration: OpenRouter API with free models
- Question Processing: Natural language understanding of user queries
- Scene Analysis: LLM interprets structured scene data
- Grounded Responses: Factual answers based only on detections
- Timeout Protection: 5-second timeout prevents blocking
Main Thread: Vision processing (30+ FPS) β Scene Summary
β
Voice Thread: Speech I/O β LLM Reasoning β Answer (Non-blocking)
β
LLM Call: OpenRouter API (1-3s latency, asynchronous)
Critical Design Principles:
- Vision never waits for LLM or voice
- Voice queries use latest scene snapshot
- LLM timeout prevents indefinite blocks
- Fallback reasoning when LLM unavailable
- Graceful degradation at all levels
This architecture ensures smooth multimodal interaction while maintaining real-time performance.
- "LLM functionality disabled": Set OPENROUTER_API_KEY environment variable
- LLM timeout errors: Network latency - LLM falls back to basic summaries
- "Model not found": Check model name format (must include ":free" suffix)
- Rate limiting: OpenRouter free tier has limits - wait and retry
- No microphone detected: Ensure microphone is connected and enabled in system settings
- Poor recognition accuracy: Speak clearly and closer to microphone
- Vosk model not found: Verify model is extracted to
object_analytics/models/vosk-model-small-en-us-0.15/
- Low FPS: Close other applications using camera/GPU
- LLM lag: Normal for free tier (1-3s) - doesn't affect vision FPS
- Audio lag: Voice processing runs in background thread - should not affect vision
- Memory usage: YOLOv8 + LLM client requires ~2.5GB RAM
- Missing dependencies: Run
pip install -r requirements.txt - Python version: Requires Python 3.12+ for optimal performance
- OpenCV errors: Install system dependencies:
pip install opencv-python - openai library errors: Update to latest:
pip install --upgrade openai
- "Model not found": Download and extract Vosk model to correct directory
- "No audio device": Check microphone permissions and connections
- "CUDA out of memory": Use CPU mode or reduce model size
- "OpenRouter connection failed": Check internet connection and API key
This project is licensed under the MIT License - see the LICENSE file for details.
- Security Systems: Real-time monitoring and alerting
- Traffic Analytics: Vehicle counting and flow analysis
- Retail Analytics: Customer behavior tracking
- Industrial Monitoring: Equipment and personnel tracking
- Drone Applications: Aerial object detection and tracking
The system includes built-in error handling and will gracefully handle:
- Camera disconnection
- Model loading failures
- Detection inference errors
- Memory constraints