JARVIS-lite: LLM-Powered Multimodal AI System

A production-ready multimodal AI system combining real-time computer vision, intelligent analytics, LLM-powered reasoning, and voice interaction. Built with YOLOv8, OpenRouter free models, Vosk, and pyttsx3 for serious human-computer interaction.

Features

Real-time Object Detection: YOLOv8-powered detection with configurable confidence thresholds
Intelligent Analytics: Priority scoring based on object size and position
LLM Reasoning: OpenRouter API with free models (Llama-3.3-70b, Mistral-7b)
Voice Input: Offline speech recognition using Vosk
Voice Output: Text-to-speech synthesis with pyttsx3
Conversational AI: Natural language scene understanding and question answering
Multimodal I/O: Vision + Voice + LLM integration
Non-blocking Architecture: Voice and LLM run parallel to vision processing
Performance Monitoring: FPS and latency measurement
Robust Error Handling: Graceful failure recovery and fallback modes
Production-Ready: Modular design with proper separation of concerns

Architecture

Camera → YOLO (Perception) → Analytics (Meaning) → LLM (Reasoning) → Answer
   ↓                                ↓                      ↓           ↓
Voice Input ← STT (Vosk) ← User ← TTS (pyttsx3) ← Response ← OpenRouter

Multimodal Pipeline

Vision Pipeline (Real-time Loop - Main Thread):
- Camera capture → Preprocessing → YOLO inference → Post-processing → Analytics → Display
- Target: 30+ FPS
Voice Pipeline (Parallel Thread - Non-blocking):
- Microphone → Speech-to-Text (Vosk) → Question
- Scene Summary → LLM (OpenRouter) → Answer
- Answer → Text-to-Speech (pyttsx3) → Speaker
Integration:
- Vision continuously updates shared scene summary
- Voice queries latest scene state without blocking vision
- LLM provides intelligent, grounded responses

Core Components

ObjectDetector: YOLOv8 wrapper with configurable confidence thresholds
ObjectAnalytics: Priority scoring and structured analytics generation (LLM-optimized)
LLMEngine: OpenRouter client with prompt engineering for grounded reasoning
VoiceEngine: Offline speech recognition and synthesis
VideoAnalyticsApp: Main application orchestrating all components with threading

Requirements

Python 3.12+
Webcam or video input device
Microphone for voice input
Speakers/headphones for voice output
~2GB RAM for YOLOv8n model
CPU/GPU with CUDA support (optional, for faster inference)
OpenRouter API key (free - see Configuration)

Installation

Clone the repository:

git clone <repository-url>
cd real-time-object-analytics-engine-for-live-video-streams

Install dependencies:
```
pip install -r requirements.txt
```
Download YOLOv8 model (included):
- yolov8n.pt is already included in the repository
Download Vosk speech recognition model:
- Visit: https://alphacephei.com/vosk/models
- Download: vosk-model-small-en-us-0.15.zip
- Extract to: object_analytics/models/vosk-model-small-en-us-0.15/
```
# After downloading and extracting
ls object_analytics/models/
# Should show: vosk-model-small-en-us-0.15/
```
Get OpenRouter API Key (FREE):
- Visit: https://openrouter.ai/keys
- Sign up for a free account
- Generate an API key
- Copy .env.example to .env:
```
cp .env.example .env
```
- Edit .env and add your API key:
```
OPENROUTER_API_KEY=your_actual_key_here
```

Configuration

Core Settings

Edit object_analytics/config.py to customize:

# Camera settings
CAMERA_INDEX = 0  # Default webcam

# Detection parameters
CONFIDENCE_THRESHOLD = 0.40  # Minimum confidence for detections
IMPORTANT_CLASSES = [0, 1, 2, 3, 5, 7]  # person, car, truck, bus, bicycle, motorbike

# Display settings
DISPLAY_WINDOW_NAME = "JARVIS-lite: Real-Time Vision + LLM"
MAX_DISPLAY_DETECTIONS = 20

Environment Variables

Required:

# OpenRouter API key for LLM reasoning
export OPENROUTER_API_KEY="your_key_here"  # Linux/Mac
set OPENROUTER_API_KEY=your_key_here       # Windows CMD
$env:OPENROUTER_API_KEY="your_key_here"    # Windows PowerShell

Optional:

# Override default model (must be free)
export OPENROUTER_MODEL="mistralai/mistral-7b-instruct:free"

Available Free Models:

meta-llama/llama-3.3-70b-instruct:free (default, best quality)
mistralai/mistral-7b-instruct:free (fast, good quality)
google/gemma-7b-it:free (Google model)
nousresearch/nous-capybara-7b:free (alternative)

Usage

Basic Usage

Run the complete JARVIS-lite system:

# Set your API key first
export OPENROUTER_API_KEY="your_key_here"  # Linux/Mac
# OR
set OPENROUTER_API_KEY=your_key_here       # Windows CMD
# OR
$env:OPENROUTER_API_KEY="your_key_here"    # Windows PowerShell

# Run the system
python -m object_analytics.main

What Happens

The system will:

Initialize YOLOv8 object detector
Connect to OpenRouter LLM (free model)
Load Vosk voice recognition model
Open your webcam
Start vision processing loop (30+ FPS)
Start voice interaction thread (parallel)
Display annotated video with FPS metrics
Listen for voice commands continuously
Respond with intelligent, scene-grounded answers

Press Ctrl+C or Q to exit gracefully.

Expected Output

The system will display:

Live video feed with bounding boxes and labels
Real-time FPS and latency metrics
Detection confidence scores
Voice interaction status in console

Console Output:

INFO - Initializing JARVIS-lite Multimodal AI System...
INFO - LLM reasoning engine initialized successfully
INFO - Voice engine initialized successfully
INFO - 🎤 JARVIS voice interaction loop started
INFO - 📡 Using LLM-powered reasoning via OpenRouter
INFO - Vision processing: 28.5 FPS
INFO - User: "What do you see?"
INFO - JARVIS: "I see 2 objects: a person with 85% confidence and a car with 92% confidence."

Voice Commands

JARVIS-lite understands natural language questions about the current scene.

Scene Overview

"What do you see?" → Complete list of detected objects
"How many objects?" → Total count
"Describe the scene" → Detailed scene analysis

Object Queries

"What is the most important object?" → Highest priority object
"Tell me about the person" → Person-specific details
"Is there a car?" → Presence detection

Spatial Queries

"What's in the center?" → Center-focused analysis
"Where is the main object?" → Position description
"Is the path clear?" → Safety assessment

Context-Aware Queries

"Should I pay attention to something?" → Priority advice
"What should I focus on?" → Important object guidance
"Is anything important happening?" → Event detection

The LLM provides intelligent, grounded answers based only on current detections.

Performance Metrics

Typical Performance (YOLOv8n on CPU):

Vision FPS: 25-30 fps (maintained during voice interaction)
LLM Latency: 1-3 seconds (via OpenRouter free tier)
Voice Recognition: ~500ms (Vosk offline)
Memory Usage: ~2.5GB RAM (includes LLM client)

Performance Factors:

Hardware: GPU acceleration significantly improves FPS
Resolution: Lower input resolution = higher FPS
Model Size: Larger YOLO models = higher accuracy, lower FPS
LLM Model: Different free models have different latencies
Network: LLM calls require internet (vision/voice work offline)

API Reference

LLMEngine(api_key, model, max_tokens, temperature)

LLM-powered reasoning engine using OpenRouter API.

Parameters:

api_key (str): OpenRouter API key (or set OPENROUTER_API_KEY env variable)
model (str): Model identifier (default: meta-llama/llama-3.3-70b-instruct:free)
max_tokens (int): Maximum response length (default: 150)
temperature (float): Sampling temperature (default: 0.3 for factual responses)

Methods:

answer(scene_summary, question, timeout) → str: Generate grounded answer
test_connection() → bool: Test OpenRouter API connectivity

Example:

llm = LLMEngine()  # Uses OPENROUTER_API_KEY from environment
answer = llm.answer(summary, "What do you see?")
# Returns: "I see 2 objects: a person with 85% confidence and a car with 92% confidence."

ObjectAnalytics.summarize(analytics_data)

Generates structured analytics summary optimized for LLM integration.

Parameters:

analytics_data (list): List of analyzed detections

Returns:

dict: Summary with total objects and complete detection list:

{
    'total_objects': int,
    'detections': [
        {
            'class': str,
            'confidence': float,
            'priority': float,
            'area_ratio': float,
            'position': str,
            'center': tuple,
            'bbox': list
        },
        ...
    ]
}

Example:

summary = analytics_engine.summarize(analyzed_detections)
# Returns complete scene data for LLM reasoning

VoiceEngine(model_path)

Offline speech recognition and text-to-speech engine.

Parameters:

model_path (str): Path to Vosk model directory

Methods:

listen() → str: Captures and transcribes speech to text
speak(text) → None: Converts text to speech output

Example:

voice = VoiceEngine("models/vosk-model-small-en-us-0.15")
question = voice.listen()  # Waits for speech
voice.speak("Hello, I can see objects in the scene")

VideoAnalyticsApp()

Main application class combining vision, analytics, LLM reasoning, and voice.

Methods:

run() → int: Starts the complete multimodal system

Example:

app = VideoAnalyticsApp()
exit_code = app.run()  # Full JARVIS-lite experience

Multimodal Architecture

JARVIS-lite combines three parallel processing pipelines:

Vision Pipeline (Main Thread)

Real-time Detection: YOLOv8 processes video frames at 30+ FPS
Object Analytics: Extracts confidence scores, positions, and priorities
Scene Summary: Maintains latest detection state for LLM reasoning

Voice Pipeline (Parallel Thread)

Speech Recognition: Vosk processes audio input offline
Text-to-Speech: pyttsx3 generates natural voice responses
Non-blocking Operation: Runs independently to preserve vision FPS

Reasoning Pipeline (On-Demand)

LLM Integration: OpenRouter API with free models
Question Processing: Natural language understanding of user queries
Scene Analysis: LLM interprets structured scene data
Grounded Responses: Factual answers based only on detections
Timeout Protection: 5-second timeout prevents blocking

Threading Model

Main Thread: Vision processing (30+ FPS) → Scene Summary
     ↓
Voice Thread: Speech I/O → LLM Reasoning → Answer (Non-blocking)
     ↓
LLM Call: OpenRouter API (1-3s latency, asynchronous)

Critical Design Principles:

Vision never waits for LLM or voice
Voice queries use latest scene snapshot
LLM timeout prevents indefinite blocks
Fallback reasoning when LLM unavailable
Graceful degradation at all levels

This architecture ensures smooth multimodal interaction while maintaining real-time performance.

Troubleshooting

LLM / OpenRouter Issues

"LLM functionality disabled": Set OPENROUTER_API_KEY environment variable
LLM timeout errors: Network latency - LLM falls back to basic summaries
"Model not found": Check model name format (must include ":free" suffix)
Rate limiting: OpenRouter free tier has limits - wait and retry

Voice Recognition Issues

No microphone detected: Ensure microphone is connected and enabled in system settings
Poor recognition accuracy: Speak clearly and closer to microphone
Vosk model not found: Verify model is extracted to object_analytics/models/vosk-model-small-en-us-0.15/

Performance Issues

Low FPS: Close other applications using camera/GPU
LLM lag: Normal for free tier (1-3s) - doesn't affect vision FPS
Audio lag: Voice processing runs in background thread - should not affect vision
Memory usage: YOLOv8 + LLM client requires ~2.5GB RAM

Installation Issues

Missing dependencies: Run pip install -r requirements.txt
Python version: Requires Python 3.12+ for optimal performance
OpenCV errors: Install system dependencies: pip install opencv-python
openai library errors: Update to latest: pip install --upgrade openai

Common Errors

"Model not found": Download and extract Vosk model to correct directory
"No audio device": Check microphone permissions and connections
"CUDA out of memory": Use CPU mode or reduce model size
"OpenRouter connection failed": Check internet connection and API key

License

This project is licensed under the MIT License - see the LICENSE file for details.

Use Cases

Security Systems: Real-time monitoring and alerting
Traffic Analytics: Vehicle counting and flow analysis
Retail Analytics: Customer behavior tracking
Industrial Monitoring: Equipment and personnel tracking
Drone Applications: Aerial object detection and tracking

Testing

The system includes built-in error handling and will gracefully handle:

Camera disconnection
Model loading failures
Detection inference errors
Memory constraints

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
object_analytics		object_analytics
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock
yolov8n.pt		yolov8n.pt

Folders and files

Latest commit

History

Repository files navigation

JARVIS-lite: LLM-Powered Multimodal AI System

Table of Contents

Features

Architecture

Multimodal Pipeline

Core Components

Requirements

Installation

Configuration

Core Settings

Environment Variables

Usage

Basic Usage

What Happens

Expected Output

Voice Commands

Scene Overview

Object Queries

Spatial Queries

Context-Aware Queries

Performance Metrics

API Reference

LLMEngine(api_key, model, max_tokens, temperature)

ObjectAnalytics.summarize(analytics_data)

VoiceEngine(model_path)

VideoAnalyticsApp()

Multimodal Architecture

Vision Pipeline (Main Thread)

Voice Pipeline (Parallel Thread)

Reasoning Pipeline (On-Demand)

Threading Model

Troubleshooting

LLM / OpenRouter Issues

Voice Recognition Issues

Performance Issues

Installation Issues

Common Errors

License

Use Cases

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages