A production-ready, multimodal video retrieval system that matches videos to natural language queries using powerful vision-language models (BLIP) and Vector Databases (ChromaDB).
This version includes a clean UI, speed optimizations, and robust error handling.
- Natural Language Search: "Find the clip where a dog is running on grass."
- Optimized Performance: Adjustable frame sampling (doesn't process every single frame, making it 5-10x faster).
- Video Summary Generation: Automatically captions video content.
- Clean UI: Tabbed interface for Search, Upload, and Library management.
- Caching: Models load once, preventing slow reloads.
| Component | Technology |
|---|---|
| Language | Python |
| Vision Model | BLIP (Salesforce) |
| Vector DB | ChromaDB |
| Audio Processing | ffmpeg, pydub |
| Frame Extraction | OpenCV |
| UI / Deployment | Streamlit |
-
Clone the repository (or download the files).
-
Create a virtual environment (recommended):
python -m venv venv # Windows venv\Scripts\activate # Mac/Linux source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt #start the demo streamlit run run.py