Local-first AI video transcription with speaker diarization, semantic search, and RAG-powered chat.
- Local AI Transcription - Faster Whisper runs on your machine, no API costs
- Speaker Diarization - Automatically identifies and labels different speakers
- Multi-format Support - MP4, MP3, WAV, WebM, MKV, and more
- Multi-language - Auto-detection and translation using MarianMT
- Semantic Search - Find content by meaning with vector embeddings
- Visual Search - CLIP-powered search by describing what you see
- Audio Analysis - Detect laughter, applause, music, and emotions
- RAG Chat - Ask questions about your video with context-aware answers
- Background Jobs - Queue large files for async processing
- Real-time Updates - Live progress via Supabase
- Share Links - Generate public links to share results
- Subtitle Export - WebVTT and SRT with translation support
- Multiple LLMs - Ollama (local), Groq, OpenAI, Anthropic, Grok
| Layer | Technologies |
|---|---|
| Frontend | React 19, TypeScript, Vite, TailwindCSS, React Query |
| Backend | FastAPI, Faster Whisper, PyTorch, Pyannote, ChromaDB |
| Infrastructure | Supabase, Google Cloud (Run, Storage, Firestore), Netlify |
- Node.js 18+, Python 3.9+, FFmpeg
- HuggingFace token (for speaker diarization)
cd backend
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # Edit with your settings
uvicorn main:app --reload --port 8000cd frontend
npm install
npm run devflowchart TB
subgraph Frontend["Frontend (React)"]
UI[UI] --> API[API Client]
API --> RT[Supabase Realtime]
end
subgraph Cloud["Cloud Services"]
GCS[(GCS)]
SB[(Supabase)]
FS[(Firestore)]
end
subgraph Backend["Backend (FastAPI)"]
TR[Transcription] --> WH[Whisper]
SR[Speaker] --> PY[Pyannote]
CR[Chat] --> VDB[(ChromaDB)]
CR --> LLM[LLM Providers]
end
API --> TR & SR & CR
RT <--> SB
TR --> FS
TR --> GCS
| Issue | Solution |
|---|---|
No module named 'torch' |
Activate venv: source venv/bin/activate |
FFmpeg not found |
Install: brew install ffmpeg (macOS) or apt install ffmpeg |
| Speaker diarization fails | Check HUGGINGFACE_TOKEN and accept pyannote terms |
| Ollama connection error | Start Ollama: ollama serve |
| Large file upload fails | Enable GCS: ENABLE_GCS_UPLOADS=true |
- Configuration Guide - All environment variables
- API Reference - Complete endpoint documentation
- Architecture - Detailed system diagrams
- Speaker Diarization Setup
- Production Deployment
ai-subs/
├── frontend/ # React + TypeScript
│ ├── src/
│ │ ├── components/
│ │ ├── hooks/
│ │ ├── services/
│ │ └── types/
│ └── package.json
├── backend/ # FastAPI + ML
│ ├── routers/ # API endpoints
│ ├── services/ # Business logic
│ ├── models/ # Pydantic schemas
│ └── main.py
└── docs/ # Documentation
Contributions welcome! Please open issues or submit pull requests.
Faster Whisper | Pyannote | ChromaDB | Ollama | CLIP | PANNs