GitHub - soorajaryan007/youtube-rag: A production-oriented Retrieval-Augmented Generation (RAG) system that ingests YouTube videos, stores semantically searchable transcript chunks in a FAISS vector database, and answers user questions using an LLM with optional video-level filtering

🎥 YouTube RAG System (Metadata-Aware)

A production-oriented Retrieval-Augmented Generation (RAG) system that ingests YouTube videos, stores semantically searchable transcript chunks in a FAISS vector database, and answers user questions using an LLM with optional video-level filtering.

This project focuses on correct RAG architecture, including lazy lifecycle handling, metadata-aware retrieval, and clean separation of ingestion, retrieval, and generation layers.

🚀 Key Capabilities

📥 Ingest YouTube transcripts via API
✂️ Robust text chunking with overlap
🧠 Semantic embeddings using Sentence Transformers
🗂️ Persistent FAISS vector storage
🏷️ Chunk-level metadata (video_id, timestamps)
🔍 Filtered retrieval per video or across all videos
🤖 LLM-powered answers (Groq / LLaMA)
⚙️ Lazy loading (safe startup with empty index)
🔎 Vector database inspection & debugging utilities

🏗️ Architecture Overview

Ingestion Pipeline (Write Path)

YouTube Video ID
        ↓
YouTube Transcript API
(text + start + duration)
        ↓
Transcript Segments
        ↓
Text Chunking
        ↓
LangChain Documents
(page_content + metadata)
        ↓
Embedding Model
(Sentence Transformers)
        ↓
FAISS Vector Store
(vectors + metadata)

Query Pipeline (Read Path)

User Question (+ optional video_id)
        ↓
Query Embedding
        ↓
FAISS Similarity Search
(global or filtered)
        ↓
Relevant Chunks
        ↓
RAG Chain
(context + question)
        ↓
LLM
        ↓
Final Answer

🧠 Metadata Model

Each stored chunk includes structured metadata:

{
  "video_id": "abc123",
  "start": 120.5,
  "end": 134.8
}

This enables:

Video-specific querying
Source attribution (future-ready)
Timestamp-based answers
Clean deletion or re-indexing per video

📂 Project Structure

youtube-rag/
├── app/
│   ├── ingestion/
│   │   ├── youtube_loader.py
│   │   ├── splitter.py
│   │   └── embed_store.py
│   ├── retrieval/
│   │   └── langchain_retriever.py
│   ├── chains/
│   │   ├── rag_chain.py
│   │   └── prompts.py
│   ├── schemas/
│   │   ├── ingest.py
│   │   └── query.py
│   └── main.py
├── vectorstore/
│   └── faiss_index/
│       ├── index.faiss
│       └── index.pkl
├── inspect_faiss.py
├── config.py
├── requirements.txt
└── README.md

⚙️ Setup & Installation

1️⃣ Create and activate a virtual environment

python3 -m venv .venv
source .venv/bin/activate

2️⃣ Install dependencies

pip install -r requirements.txt

3️⃣ Configure environment variables

export GROQ_API_KEY=your_groq_api_key

▶️ Running the Service

uvicorn app.main:app --reload

API documentation is available at:

http://127.0.0.1:8000/docs

📥 Ingest a YouTube Video

POST /ingest

{
  "video_id": "aMARZGTbULc"
}

Ingestion steps:

Fetch transcript
Chunk text
Generate embeddings
Persist vectors with metadata in FAISS

❓ Query the System

Search across all ingested videos

POST /ask

{
  "question": "How does HTTPS work?"
}

Restrict search to a specific video

{
  "question": "Explain the TLS handshake",
  "video_id": "aMARZGTbULc"
}

🔎 Inspecting the Vector Database

Use the inspection utility:

python inspect_faiss.py

This allows you to:

Verify stored chunks
Inspect metadata
Debug retrieval quality
Understand what context the LLM receives

🧠 Design Principles

Lazy initialization of vector store and RAG chain
Stateless application startup
Clear separation of concerns
Metadata-first retrieval design
Production-safe lifecycle handling

🚧 Current Limitations

No source citations in responses
No conversational memory
No per-video deletion endpoint
API-only (no frontend)

🔮 Planned Enhancements

📌 Source citations with timestamps
🧹 Delete or reindex individual videos
💬 Conversational RAG
📊 Video-level relevance ranking
🖥️ Frontend interface

📌 One-Line Summary

A metadata-aware YouTube RAG system that ingests transcripts, stores semantically searchable chunks in FAISS, and answers questions using filtered retrieval and an LLM.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
arc-image		arc-image
other		other
vectorstore/faiss_index		vectorstore/faiss_index
.gitignore		.gitignore
README.md		README.md
inspect_faiss.py		inspect_faiss.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎥 YouTube RAG System (Metadata-Aware)

🚀 Key Capabilities

🏗️ Architecture Overview

Ingestion Pipeline (Write Path)

Query Pipeline (Read Path)

🧠 Metadata Model

📂 Project Structure

⚙️ Setup & Installation

1️⃣ Create and activate a virtual environment

2️⃣ Install dependencies

3️⃣ Configure environment variables

▶️ Running the Service

📥 Ingest a YouTube Video

❓ Query the System

Search across all ingested videos

Restrict search to a specific video

🔎 Inspecting the Vector Database

🧠 Design Principles

🚧 Current Limitations

🔮 Planned Enhancements

📌 One-Line Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎥 YouTube RAG System (Metadata-Aware)

🚀 Key Capabilities

🏗️ Architecture Overview

Ingestion Pipeline (Write Path)

Query Pipeline (Read Path)

🧠 Metadata Model

📂 Project Structure

⚙️ Setup & Installation

1️⃣ Create and activate a virtual environment

2️⃣ Install dependencies

3️⃣ Configure environment variables

▶️ Running the Service

📥 Ingest a YouTube Video

❓ Query the System

Search across all ingested videos

Restrict search to a specific video

🔎 Inspecting the Vector Database

🧠 Design Principles

🚧 Current Limitations

🔮 Planned Enhancements

📌 One-Line Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages