A powerful, multimodal AI assistant designed to empower individuals with vision and hearing impairments. Built with Gradio, Modal, Google Gemini, ElevenLabs, and Hugging Face models.
- Scene Description: Detects objects and describes the scene using OwlViT.
- Smart OCR: Reads text from documents, signs, and screens using Google Gemini 2.5 Flash.
- Text Simplification: Summarizes complex text into easy-to-understand language.
- Text-to-Speech (TTS): Reads out the simplified text using ElevenLabs (Bella voice).
- Haptic Feedback: Maps detected sounds (e.g., "car horn", "siren") to vibration patterns (simulated).
- Live Captioning: Real-time transcription of speech using Distil-Whisper.
- Emotion Detection: Identifies the speaker's emotion (e.g., "Happy", "Sad", "Angry") using DistilHuBERT.
- Speaker Diarization: Identifies who is speaking (e.g., "SPEAKER_00", "SPEAKER_01") using pyannote.audio.
- Low Latency: Optimized for real-time interaction with parallel processing.
- MCP (Model Context Protocol): Connects to external tools like Calendar, Email, and Maps (Mock implementation).
- Frontend: Gradio (Python)
- Backend: Modal (Serverless GPU inference)
- AI Models:
- Vision:
google/gemini-2.5-flash-image,google/owlvit-base-patch32 - Hearing:
distil-whisper/distil-large-v2 - Emotion:
BilalHasan/distilhubert-finetuned-ravdess(ONNX) - Diarization:
pyannote/speaker-diarization-3.1 - TTS: ElevenLabs API
- Vision:
git clone https://github.com/yourusername/accessibility-companion.git
cd accessibility-companionIt is recommended to use a virtual environment (Conda or venv).
pip install -r requirements.txtNote: You also need ffmpeg installed on your system.
Create a .env file in the root directory:
GEMINI_API_KEY=your_gemini_key
ELEVENLABS_API_KEY=your_elevenlabs_key
HF_TOKEN=your_huggingface_tokenYou need a Modal account. Authenticate first:
modal setupThen deploy the backend functions:
modal deploy modal_app.pypython app.pyOpen your browser at http://localhost:7860.
app.py: Main Gradio application (Frontend & Orchestration).modal_app.py: Modal backend definitions (GPU inference).utils.py: Helper functions for TTS and text processing.requirements.txt: Python dependencies.
Pull requests are welcome! Please open an issue first to discuss changes.
MIT License