An advanced voice-controlled AI assistant with Gemini-powered conversational abilities, secure commands, short-term memory, and computer vision capabilities such as OCR, object detection (YOLOv8), and scene description (BLIP).
- Overview
- Features
- Architecture
- Project Structure
- Installation
- Configuration
- Usage
- Example Commands
- Security
- Computer Vision Models
- Troubleshooting
- License
This assistant can:
- Listen for a wake word ("yo").
- Understand and execute voice commands.
- Answer questions using Google's Gemini AI API.
- Perform computer vision tasks like reading text from images, detecting objects, and describing scenes.
- Remember and recall facts during the session.
- Open applications, search the web, find files, and read schedules.
- Secure sensitive commands with password authentication.
- Wake Word Detection – "yo" triggers listening mode.
- Conversational AI – Gemini API for natural responses.
- Voice Command Execution – Runs utilities and vision tasks.
- Secure Commands – Protects sensitive actions with passwords.
- Memory System – Stores and recalls short-term facts.
- Computer Vision – OCR, YOLOv8 object detection, BLIP scene description.
- Utilities – Open apps, search the web, search local files, tell date/schedule.
- Fallback Chatbot – Default to Gemini for unmatched commands.
- Graceful Exit – "good bye", "exit", or
Ctrl+C.
main.py
│
├── Wait for wake word → wakeword.py
│
├── Listen to voice → voice.py
│ ├── Secure commands → security.py
│ ├── Utility commands → utils.py
│ ├── Vision commands → vision.py
│ ├── Chat fallback → chat.py
│
├── Speak output → shared.py
project/
│── core/
│ ├── chat.py # Gemini API integration
│ ├── security.py # Password-protected commands
│ ├── shared.py # Shared TTS engine & memory
│ ├── utils.py # Utility functions
│ ├── vision.py # OCR, object detection, scene description
│ ├── voice.py # Speech recognition & command handling
│ ├── wakeword.py # Wake word detection
│── main.py # Entry point
│── .env # Environment variables
│── requirements.txt # Dependencies
│── yolov8n.pt # YOLOv8 object detection model
│── screenshot.png # Sample image for OCR
- Clone the repository
git clone https://github.com/yourusername/ai-voice-assistant.git
cd ai-voice-assistant- Install dependencies
pip install -r requirements.txt- Set up environment variables
Create
.envfile:
GEMINI_API_KEY=your_gemini_api_key_here-
Download YOLOv8 model
Ensureyolov8n.ptis in the root directory. -
Install Tesseract OCR
- Download: https://github.com/tesseract-ocr/tesseract
- Update path in
vision.pyif needed.
- Wake Word – Change in
wakeword.py:
WAKE_WORD = "yo"- Password – Change in
security.py:
PASSWORD = "1234"Run the assistant:
python main.py| Command | Action |
|---|---|
| "Hello" | Greets the user |
| "Open notepad" | Opens Notepad |
| "Search in browser artificial intelligence" | Google search |
| "Search file report" | Search local files |
| "Tell me the date" | Speaks today's date |
| "Tell me my schedule" | Reads schedule.txt |
| "Remember I have a meeting at 4 PM" | Stores memory |
| "Recall my memories" | Reads back memories |
| "Capture screen" | Saves a screenshot |
| "Screenshot and explain" | OCR + Gemini explanation |
| "Read text" | Reads text live from camera |
| "Identify objects" | YOLOv8 object detection |
| "Describe scene" | BLIP scene captioning |
| "Goodbye" / "Exit" | Exit program |
- Certain commands ("open", "delete", "shut down") require a password.
- Default password:
1234.
- YOLOv8 (
yolov8n.pt) – Object detection. - BLIP – Scene description.
- Tesseract OCR – Text recognition.
- Speech not detected → Check microphone &
speechrecognitioninstallation. - Gemini API errors → Verify
GEMINI_API_KEYin.env. - OCR not working → Ensure Tesseract is installed & path is set.
- Object detection slow → Use smaller YOLO model (
yolov8n.pt).
This project is open-source. Modify and use freely.