AI-powered semantic search engine for video content β find any moment, instantly.
ChronoView transforms any video β a lecture, meeting, tutorial, or conference talk β into a fully searchable knowledge base using multimodal AI. Type a natural language query and jump to the exact timestamp where that moment occurs. No scrubbing. No guessing. No re-watching hours of content.
- 500+ hours of video are uploaded every minute globally
- There is no "Ctrl+F" for video content
- Students waste hours scrubbing lecture recordings
- Enterprises lose $37B/year to unsearchable meeting recordings
- Existing tools match keywords β not meaning
ChronoView processes three parallel data streams from every video:
| Stream | Model | Output |
|---|---|---|
| ποΈ Speech | OpenAI Whisper | Timestamped transcripts |
| ποΈ Visual | Vision Transformer (ViT) | Scene embeddings |
| π On-screen text | Tesseract OCR | Slide & code text |
All three streams are fused using a CLIP-inspired contrastive learning model into a unified semantic embedding β stored in a FAISS vector database for millisecond-speed retrieval.
- π Semantic Search β natural language query β exact timestamp
- π§ Direct Q&A β AI-generated answers extracted from the video
- π Auto-Chapters β AI-generated titled navigation segments
- π Multilingual Search β query in any language
- π Shareable Timestamp Links β share exact video moments
- π Engagement Heatmaps β analytics on which segments were searched most
- βοΈ Highlight Reel Export β compile relevant segments into a short clip
Video Input
β
βΌ
FFmpeg (segment splitting Β· keyframe extraction Β· audio strip)
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ βΌ βΌ
Whisper ASR Tesseract OCR ViT Model
(Speech β text) (Slides & code text) (Scene encoding)
β β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
CLIP Fusion Model (PyTorch)
Unified semantic embedding space
β
βΌ
FAISS / ChromaDB
Vector similarity index
β
βΌ
FastAPI Backend (REST API)
β
βΌ
React Dashboard + Video Player
| Layer | Technology |
|---|---|
| Video Processing | FFmpeg |
| Speech Recognition | OpenAI Whisper |
| Scene Understanding | Vision Transformer (ViT) |
| OCR | Tesseract |
| Semantic Fusion | CLIP (PyTorch + HuggingFace) |
| Vector Search | FAISS / ChromaDB |
| Backend API | FastAPI |
| Frontend | React / Streamlit |
| Analytics | Plotly |
| Storage | AWS S3 / Google Cloud Storage |
| Containerization | Docker |
Python 3.10+
NVIDIA GPU (RTX 3060 or higher recommended)
CUDA 11.8+
Node.js 18+ (for React frontend)
FFmpeg installed on system# Clone the repository
git clone https://github.com/yourusername/chronoview.git
cd chronoview
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install frontend dependencies
cd frontend
npm install
cd ..# Copy environment template
cp .env.example .env
# Add your keys in .env
OPENAI_WHISPER_MODEL=base
HUGGINGFACE_TOKEN=your_token_here
AWS_ACCESS_KEY=your_key_here
AWS_SECRET_KEY=your_secret_here# Step 1 β Index a video
python pipeline/index_video.py --input your_video.mp4
# Step 2 β Start the backend
uvicorn app.main:app --reload --port 8000
# Step 3 β Start the frontend
cd frontend && npm run devOpen http://localhost:3000 in your browser.
chronoview/
β
βββ pipeline/
β βββ ingest.py # FFmpeg video segmentation
β βββ transcribe.py # Whisper ASR
β βββ ocr.py # Tesseract OCR
β βββ vision.py # ViT scene encoding
β βββ fuse.py # CLIP fusion model
β βββ index.py # FAISS vector indexing
β
βββ app/
β βββ main.py # FastAPI entry point
β βββ search.py # Query embedding + retrieval
β βββ qa.py # Direct Q&A generation
β βββ analytics.py # Heatmap + usage analytics
β
βββ frontend/
β βββ src/
β β βββ pages/ # Home, Results, Library, Analytics
β β βββ components/ # SearchBar, ResultCard, VideoPlayer
β βββ package.json
β
βββ models/ # Saved model checkpoints
βββ tests/ # Unit and integration tests
βββ docker-compose.yml
βββ requirements.txt
βββ README.md
POST /api/search
Content-Type: application/json
{
"query": "explain gradient descent",
"video_id": "cs229_lecture4",
"top_k": 5
}Response:
{
"query": "explain gradient descent",
"ai_answer": "Gradient descent minimizes loss by...",
"results": [
{
"timestamp": "14:32",
"title": "Gradient descent intuition",
"snippet": "...learning rate alpha controls step size...",
"confidence": 0.97,
"sources": ["audio", "slide"]
}
]
}| # | Paper | Venue |
|---|---|---|
| 1 | Radford et al. β CLIP (2021) | ICML 2021 |
| 2 | Radford et al. β Whisper (2022) | arXiv:2212.04356 |
| 3 | Dosovitskiy et al. β ViT (2020) | ICLR 2021 |
| 4 | Johnson et al. β FAISS (2019) | IEEE Trans. Big Data |
| 5 | Liu et al. β Video Moment Localization (2023) | ACM Computing Surveys |
| Sector | Use Case |
|---|---|
| π Education | Students search lecture recordings by concept |
| π’ Enterprise | Teams retrieve decisions from meeting archives |
| π¬ Research | Scientists index conference talks and webinars |
| π§βπ» Developers | Search coding tutorials for exact implementations |
| βΏ Accessibility | Semantic index for hearing-impaired users |
- Multimodal pipeline (Whisper + ViT + OCR)
- CLIP-based semantic fusion
- FAISS vector indexing
- FastAPI search endpoint
- React dashboard
- Cross-video search across entire libraries
- Highlight reel export
- Mobile app
- Enterprise SSO integration
- Fine-tuned domain-specific embedding model
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
# Run tests
pytest tests/
# Format code
black pipeline/ app/MIT License β see LICENSE for details.
Tanmay Built for hackathon β "Skip to the Good Part."
ChronoView does for video what Google did for the web.