AI Voice & Vision Assistant

An advanced voice-controlled AI assistant with Gemini-powered conversational abilities, secure commands, short-term memory, and computer vision capabilities such as OCR, object detection (YOLOv8), and scene description (BLIP).

📌 Table of Contents

Overview
Features
Architecture
Project Structure
Installation
Configuration
Usage
Example Commands
Security
Computer Vision Models
Troubleshooting
License

📖 Overview

This assistant can:

Listen for a wake word ("yo").
Understand and execute voice commands.
Answer questions using Google's Gemini AI API.
Perform computer vision tasks like reading text from images, detecting objects, and describing scenes.
Remember and recall facts during the session.
Open applications, search the web, find files, and read schedules.
Secure sensitive commands with password authentication.

🚀 Features

Wake Word Detection – "yo" triggers listening mode.
Conversational AI – Gemini API for natural responses.
Voice Command Execution – Runs utilities and vision tasks.
Secure Commands – Protects sensitive actions with passwords.
Memory System – Stores and recalls short-term facts.
Computer Vision – OCR, YOLOv8 object detection, BLIP scene description.
Utilities – Open apps, search the web, search local files, tell date/schedule.
Fallback Chatbot – Default to Gemini for unmatched commands.
Graceful Exit – "good bye", "exit", or Ctrl+C.

🏗 Architecture

main.py
│
├── Wait for wake word → wakeword.py
│
├── Listen to voice → voice.py
│     ├── Secure commands → security.py
│     ├── Utility commands → utils.py
│     ├── Vision commands → vision.py
│     ├── Chat fallback → chat.py
│
├── Speak output → shared.py

📂 Project Structure

project/
│── core/
│   ├── chat.py          # Gemini API integration
│   ├── security.py      # Password-protected commands
│   ├── shared.py        # Shared TTS engine & memory
│   ├── utils.py         # Utility functions
│   ├── vision.py        # OCR, object detection, scene description
│   ├── voice.py         # Speech recognition & command handling
│   ├── wakeword.py      # Wake word detection
│── main.py              # Entry point
│── .env                 # Environment variables
│── requirements.txt     # Dependencies
│── yolov8n.pt           # YOLOv8 object detection model
│── screenshot.png       # Sample image for OCR

💻 Installation

Clone the repository

git clone https://github.com/yourusername/ai-voice-assistant.git
cd ai-voice-assistant

Install dependencies

pip install -r requirements.txt

Set up environment variables Create .env file:

GEMINI_API_KEY=your_gemini_api_key_here

Download YOLOv8 model
Ensure yolov8n.pt is in the root directory.
Install Tesseract OCR

Download: https://github.com/tesseract-ocr/tesseract
Update path in vision.py if needed.

⚙ Configuration

Wake Word – Change in wakeword.py:

WAKE_WORD = "yo"

Password – Change in security.py:

PASSWORD = "1234"

▶️ Usage

Run the assistant:

python main.py

🎙 Example Commands

Command	Action
"Hello"	Greets the user
"Open notepad"	Opens Notepad
"Search in browser artificial intelligence"	Google search
"Search file report"	Search local files
"Tell me the date"	Speaks today's date
"Tell me my schedule"	Reads `schedule.txt`
"Remember I have a meeting at 4 PM"	Stores memory
"Recall my memories"	Reads back memories
"Capture screen"	Saves a screenshot
"Screenshot and explain"	OCR + Gemini explanation
"Read text"	Reads text live from camera
"Identify objects"	YOLOv8 object detection
"Describe scene"	BLIP scene captioning
"Goodbye" / "Exit"	Exit program

🔐 Security

Certain commands ("open", "delete", "shut down") require a password.
Default password: 1234.

🖼 Computer Vision Models

YOLOv8 (yolov8n.pt) – Object detection.
BLIP – Scene description.
Tesseract OCR – Text recognition.

🛠 Troubleshooting

Speech not detected → Check microphone & speechrecognition installation.
Gemini API errors → Verify GEMINI_API_KEY in .env.
OCR not working → Ensure Tesseract is installed & path is set.
Object detection slow → Use smaller YOLO model (yolov8n.pt).

📜 License

This project is open-source. Modify and use freely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Voice & Vision Assistant

📌 Table of Contents

📖 Overview

🚀 Features

🏗 Architecture

📂 Project Structure

💻 Installation

⚙ Configuration

▶️ Usage

🎙 Example Commands

🔐 Security

🖼 Computer Vision Models

🛠 Troubleshooting

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
core		core
models/vosk-model-small-en-us-0.15		models/vosk-model-small-en-us-0.15
.env		.env
README.md		README.md
gitignore		gitignore
main.py		main.py
requirements.txt		requirements.txt
screenshot.png		screenshot.png
yolov8n.pt		yolov8n.pt

Folders and files

Latest commit

History

Repository files navigation

AI Voice & Vision Assistant

📌 Table of Contents

📖 Overview

🚀 Features

🏗 Architecture

📂 Project Structure

💻 Installation

⚙ Configuration

▶️ Usage

🎙 Example Commands

🔐 Security

🖼 Computer Vision Models

🛠 Troubleshooting

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages