Transcribe, summarize, and chat with any YouTube video — powered by AI.
Snip.AI is a cross-platform Flutter application that lets you paste a YouTube link and instantly get a full transcription, an AI-generated summary, and an interactive chat interface to ask questions about the video's content. Under the hood, it combines faster-whisper for speech recognition, Google Gemini for summarization and chat, and a vector database (Qdrant) for Retrieval-Augmented Generation (RAG).
- Features
- Architecture
- Tech Stack
- Prerequisites
- Installation
- Running the App
- API Reference
- Project Structure
- How It Works
- Configuration
- Known Limitations
- Contributing
| Feature | Description |
|---|---|
| 🎙️ Transcription | Downloads audio via yt-dlp and transcribes it with faster-whisper (Tiny model, INT8 quantized for speed) |
| 📝 AI Summarization | Sends the transcript through Google Gemini 2.0 Flash to produce structured, detailed summaries with headings and bullet points |
| 💬 RAG Chat | Ask any question about the video; the system retrieves relevant transcript chunks from Qdrant and feeds them to Gemini for grounded answers |
| 📌 Source Citations | Chat responses include timestamped source segments from the original transcript so you can verify answers |
| 🖥️ Cross-Platform UI | Built with Flutter — runs on Android, iOS, macOS, Linux, and Windows |
| ⚡ Async Processing | Vector DB storage happens in background tasks so the UI is never blocked |
┌──────────────────────────────────────────────┐
│ Flutter Frontend │
│ ┌──────────┐ ┌───────────────────┐ │
│ │ Home Page│ │ Results Page │ │
│ │ (URL │─────────▶│ ┌─────────────┐ │ │
│ │ Input) │ │ │ Summary │ │ │
│ └──────────┘ │ ├─────────────┤ │ │
│ │ │ RAG Chat │ │ │
│ │ └─────────────┘ │ │
│ └───────────────────┘ │
└────────────────────┬─────────────────────────┘
│ HTTP (REST)
▼
┌──────────────────────────────────────────────┐
│ FastAPI Backend (Python) │
│ │
│ POST /transcribe POST /chat │
│ │ │ │
│ ▼ ▼ │
│ yt-dlp → MP3 Query Embedding │
│ │ │ │
│ ▼ ▼ │
│ faster-whisper Qdrant Vector DB │
│ (Whisper Tiny) │ │
│ │ ▼ │
│ ▼ Gemini API │
│ Gemini API (Answer) │
│ (Summary) │
│ │ │
│ ▼ │
│ Qdrant (Store chunks + summary) │
└──────────────────────────────────────────────┘
- User pastes a YouTube URL → Flutter sends
POST /transcribe - Backend downloads audio with
yt-dlp, transcribes withfaster-whisper - Transcript is chunked into ~30-second segments
- Long transcripts are split and summarized in parallel chunks via Gemini
- Chunks and summary are embedded with
sentence-transformersand stored in Qdrant - Flutter renders the summary using
flutter_markdown - User asks a question → Flutter sends
POST /chat - Backend embeds the query, searches Qdrant for relevant chunks, and calls Gemini with the retrieved context
- Gemini returns a grounded answer with source references
| Library | Version | Purpose |
|---|---|---|
| Flutter | SDK | Cross-platform UI framework |
google_fonts |
^6.2.1 | Montserrat & Poppins typography |
flutter_markdown |
^0.6.18 | Rendering AI-generated markdown summaries |
http |
^1.4.0 | HTTP client for API calls |
cupertino_icons |
^1.0.8 | iOS-style icons |
| Library | Purpose |
|---|---|
| FastAPI | Async REST API framework |
faster-whisper |
Optimized Whisper transcription (INT8) |
yt-dlp |
YouTube audio/video downloading |
sentence-transformers |
Text embedding (all-MiniLM-L6-v2, 384-dim) |
qdrant-client |
Vector database client (in-memory mode) |
httpx |
Async HTTP client for Gemini API calls |
| Google Gemini 2.0 Flash | Summarization and RAG-powered chat |
numpy |
Fallback random embeddings |
- Flutter SDK ≥ 3.27.0 (install guide)
- Python 3.8+ with
pip - Node.js (optional, for tooling)
- yt-dlp — installed separately or via pip:
pip install yt-dlp
- ffmpeg — required by
yt-dlpfor audio conversion:# macOS brew install ffmpeg # Ubuntu/Debian sudo apt install ffmpeg # Windows winget install ffmpeg
- A valid Google Gemini API key (free tier available at Google AI Studio)
git clone https://github.com/shivansh00011/Snip.AI.git
cd Snip.AIflutter pub getpip install -r server/requirements.txtTip: It is recommended to use a virtual environment:
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install -r server/requirements.txt
Open server/main.py and replace the placeholder with your key:
GEMINI_API_KEY = "YOUR_GEMINI_API_KEY_HERE"Security note: For production use, set this as an environment variable and load it with
os.environ.get("GEMINI_API_KEY")instead of hardcoding it.
uvicorn server.main:app --reload --host 127.0.0.1 --port 8000You should see startup logs confirming that the embedding model, Whisper model, and Qdrant are initialized:
INFO: Loading embedding model...
INFO: Embedding model loaded successfully
INFO: Loading Whisper model...
INFO: Whisper model loaded successfully
INFO: Connecting to Qdrant...
INFO: Qdrant initialized successfully
Visit http://127.0.0.1:8000 to verify the server is running. The health endpoint returns service status.
flutter runSelect your target platform when prompted (Chrome, macOS, Android emulator, etc.).
For a specific platform:
flutter run -d macos # macOS desktop
flutter run -d chrome # Web browser
flutter run -d linux # Linux desktop
flutter run -d windows # Windows desktopHealth check — returns service availability status.
Response:
{
"message": "Server is up and running 🔥",
"services": {
"embedding_model": "Available",
"whisper_model": "Available",
"vector_db": "Available"
}
}Downloads audio from a YouTube URL, transcribes it, generates a summary, and stores everything in the vector DB.
Request body:
{
"youtube_url": "https://www.youtube.com/watch?v=..."
}Response:
{
"transcript": "Full transcript text...",
"summary": "## AI Summary\n\n...",
"session_id": "uuid-v4",
"chunks_count": 24,
"metadata": {
"title": "Video Title",
"duration": 3600.0,
"video_url": "https://...",
"transcript_id": "uuid-v4"
}
}Note: The
session_id/transcript_idmust be saved on the client to enable chat for this video.
Queries the transcribed content using RAG (Retrieval-Augmented Generation).
Request body:
{
"query": "What did the speaker say about climate change?",
"session_id": "uuid-v4"
}Response:
{
"answer": "According to the transcript, the speaker discussed...",
"context": {
"chunks": [
{
"text": "...relevant segment...",
"start_time": 142.5,
"end_time": 172.5,
"score": 0.89,
"chunk_index": 4,
"type": "chunk"
}
],
"summary": {
"text": "## Summary...",
"type": "summary",
"score": 1.0
}
}
}Check whether transcript chunks have been stored in the vector DB for a given session.
Response:
{
"status": "completed",
"message": "Transcript processing complete",
"chunks_available": true,
"chunks_count": 48
}Snip.AI/
├── lib/
│ ├── main.dart # App entry point & routing
│ └── pages and logic/
│ ├── home.dart # URL input & feature showcase UI
│ └── results.dart # Summary view + RAG chat UI
│
├── server/
│ └── main.py # FastAPI backend (all logic)
│
├── android/ # Android platform files
├── ios/ # iOS platform files
├── macos/ # macOS platform files
├── linux/ # Linux platform files
├── windows/ # Windows platform files
├── web/ # Web platform files
│
├── pubspec.yaml # Flutter dependencies
└── README.md
yt-dlpdownloads the YouTube video audio as MP3 (max 5-minute timeout)faster-whisper(Tiny model with INT8 quantization) transcribes the audio into timed segments- Segments are chunked into ~30-second windows to preserve temporal context
- Transcripts under ~4,000 words are sent to Gemini in a single request
- Longer transcripts are split into 3,000-word chunks, each summarized independently, then a final synthesis pass combines them
- The system retries up to 3 times per chunk on failure, with a fallback using the first and last segments
- User queries are embedded using
sentence-transformers(all-MiniLM-L6-v2) - Qdrant performs cosine similarity search filtered by
session_idto retrieve the top 5 relevant chunks - The video summary (stored as a special point) is always prepended to the context
- The assembled context + user query is sent to Gemini with an instruction prompt to provide grounded, citation-backed answers
- Each ~30-second transcript chunk becomes a Qdrant point with metadata:
session_id,start_time,end_time,chunk_index - The full summary is stored as a separate point with
type: "summary"for fast retrieval - Storage happens asynchronously via FastAPI
BackgroundTasksso the API response is not delayed
| Setting | Location | Default | Description |
|---|---|---|---|
| Gemini API Key | server/main.py |
"" |
Required — get from Google AI Studio |
| Whisper model size | server/main.py |
"tiny" |
Options: tiny, base, small, medium, large |
| Whisper compute type | server/main.py |
"int8" |
Options: int8, float16, float32 |
| Chunk size (seconds) | server/main.py |
30.0 |
Length of transcript segments for RAG |
| Embedding model | server/main.py |
all-MiniLM-L6-v2 |
384-dim sentence transformer |
| Embedding dimensions | server/main.py |
384 |
Must match embedding model output |
| Qdrant mode | server/main.py |
In-memory | Change to QdrantClient(host="localhost", port=6333) for persistence |
| Backend URL | lib/pages and logic/home.dart |
http://127.0.0.1:8000 |
Update for production deployments |
- In-memory vector DB: Qdrant runs in-memory by default, meaning all stored transcripts are lost when the server restarts. Switch to a persistent Qdrant instance for production use.
- Whisper Tiny accuracy: The
tinymodel is fast but may struggle with heavy accents, technical jargon, or low-quality audio. Usesmallormediumfor better accuracy at the cost of speed. - Gemini API rate limits: Free-tier Gemini accounts have rate and quota limits that may cause failures on very long videos.
- API key exposure: The Gemini API key is currently hardcoded in
server/main.py. Move it to an environment variable before deploying. - No authentication: The backend has no user authentication. Any client with network access can submit URLs and use your Gemini quota.
- Backend URL hardcoded: The Flutter app points to
http://127.0.0.1:8000. This needs to be configurable for multi-device or production use. - Private/restricted videos:
yt-dlpcannot download private, age-restricted, or DRM-protected content.
Contributions are welcome! Here's how to get started:
- Fork this repository
- Create a feature branch:
git checkout -b feature/my-feature - Make your changes and test them
- Commit with a clear message:
git commit -m "Add: my feature description" - Push to your fork:
git push origin feature/my-feature - Open a Pull Request against
main
- Persistent Qdrant storage with session management
- Support for non-YouTube video sources (direct MP4 URLs, Vimeo, etc.)
- Export transcript/summary as PDF or Markdown
- Dark/light theme toggle in the Flutter UI
- Environment variable support for backend configuration
- Docker Compose setup for one-command deployment
- User authentication and per-user transcript history