Skip to content

soorajaryan007/youtube-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎥 YouTube RAG System (Metadata-Aware)

A production-oriented Retrieval-Augmented Generation (RAG) system that ingests YouTube videos, stores semantically searchable transcript chunks in a FAISS vector database, and answers user questions using an LLM with optional video-level filtering.

This project focuses on correct RAG architecture, including lazy lifecycle handling, metadata-aware retrieval, and clean separation of ingestion, retrieval, and generation layers.


🚀 Key Capabilities

  • 📥 Ingest YouTube transcripts via API
  • ✂️ Robust text chunking with overlap
  • 🧠 Semantic embeddings using Sentence Transformers
  • 🗂️ Persistent FAISS vector storage
  • 🏷️ Chunk-level metadata (video_id, timestamps)
  • 🔍 Filtered retrieval per video or across all videos
  • 🤖 LLM-powered answers (Groq / LLaMA)
  • ⚙️ Lazy loading (safe startup with empty index)
  • 🔎 Vector database inspection & debugging utilities

🏗️ Architecture Overview

Ingestion Pipeline (Write Path)

YouTube Video ID
        ↓
YouTube Transcript API
(text + start + duration)
        ↓
Transcript Segments
        ↓
Text Chunking
        ↓
LangChain Documents
(page_content + metadata)
        ↓
Embedding Model
(Sentence Transformers)
        ↓
FAISS Vector Store
(vectors + metadata)

Query Pipeline (Read Path)

User Question (+ optional video_id)
        ↓
Query Embedding
        ↓
FAISS Similarity Search
(global or filtered)
        ↓
Relevant Chunks
        ↓
RAG Chain
(context + question)
        ↓
LLM
        ↓
Final Answer

🧠 Metadata Model

Each stored chunk includes structured metadata:

{
  "video_id": "abc123",
  "start": 120.5,
  "end": 134.8
}

This enables:

  • Video-specific querying
  • Source attribution (future-ready)
  • Timestamp-based answers
  • Clean deletion or re-indexing per video

📂 Project Structure

youtube-rag/
├── app/
│   ├── ingestion/
│   │   ├── youtube_loader.py
│   │   ├── splitter.py
│   │   └── embed_store.py
│   ├── retrieval/
│   │   └── langchain_retriever.py
│   ├── chains/
│   │   ├── rag_chain.py
│   │   └── prompts.py
│   ├── schemas/
│   │   ├── ingest.py
│   │   └── query.py
│   └── main.py
├── vectorstore/
│   └── faiss_index/
│       ├── index.faiss
│       └── index.pkl
├── inspect_faiss.py
├── config.py
├── requirements.txt
└── README.md

⚙️ Setup & Installation

1️⃣ Create and activate a virtual environment

python3 -m venv .venv
source .venv/bin/activate

2️⃣ Install dependencies

pip install -r requirements.txt

3️⃣ Configure environment variables

export GROQ_API_KEY=your_groq_api_key

▶️ Running the Service

uvicorn app.main:app --reload

API documentation is available at:

http://127.0.0.1:8000/docs

📥 Ingest a YouTube Video

POST /ingest

{
  "video_id": "aMARZGTbULc"
}

Ingestion steps:

  • Fetch transcript
  • Chunk text
  • Generate embeddings
  • Persist vectors with metadata in FAISS

❓ Query the System

Search across all ingested videos

POST /ask

{
  "question": "How does HTTPS work?"
}

Restrict search to a specific video

{
  "question": "Explain the TLS handshake",
  "video_id": "aMARZGTbULc"
}

🔎 Inspecting the Vector Database

Use the inspection utility:

python inspect_faiss.py

This allows you to:

  • Verify stored chunks
  • Inspect metadata
  • Debug retrieval quality
  • Understand what context the LLM receives

🧠 Design Principles

  • Lazy initialization of vector store and RAG chain
  • Stateless application startup
  • Clear separation of concerns
  • Metadata-first retrieval design
  • Production-safe lifecycle handling

🚧 Current Limitations

  • No source citations in responses
  • No conversational memory
  • No per-video deletion endpoint
  • API-only (no frontend)

🔮 Planned Enhancements

  • 📌 Source citations with timestamps
  • 🧹 Delete or reindex individual videos
  • 💬 Conversational RAG
  • 📊 Video-level relevance ranking
  • 🖥️ Frontend interface

📌 One-Line Summary

A metadata-aware YouTube RAG system that ingests transcripts, stores semantically searchable chunks in FAISS, and answers questions using filtered retrieval and an LLM.


About

A production-oriented Retrieval-Augmented Generation (RAG) system that ingests YouTube videos, stores semantically searchable transcript chunks in a FAISS vector database, and answers user questions using an LLM with optional video-level filtering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages