TubeAtlas

An advanced platform for large-scale YouTube content analysis, transforming unstructured video transcripts into structured knowledge graphs and enabling powerful, retrieval-augmented querying.

📖 Overview

TubeAtlas is engineered to unlock the vast knowledge repository within YouTube. It automates the entire pipeline from content ingestion to insight generation, allowing researchers, creators, and analysts to deeply understand and interact with video content at scale.

The core of TubeAtlas is a sophisticated Retrieval-Augmented Generation (RAG) pipeline that intelligently processes transcripts, builds comprehensive knowledge graphs, and facilitates natural language conversations with the content of entire YouTube channels, even those containing millions of tokens.

🏗️ Architecture

The system is designed with a clean, scalable architecture that separates concerns into distinct layers. It leverages asynchronous processing to handle long-running, resource-intensive tasks efficiently.

graph TD
    subgraph "Input Layer"
        A["YouTube Channels/Videos"] --> B("YouTube Service")
    end

    subgraph "Processing Layer (Async)"
        B --> C{"Celery + Redis"}
        C --> D["Transcript Service"]
        D --> E["RAG Pipeline"]
    end

    subgraph "RAG Pipeline"
        direction LR
        E_Chunk["Chunking<br/>(Semantic, Fixed)"]
        E_Embed["Embedding<br/>(OpenAI)"]
        E_KG["Graph Extraction<br/>(LLM)"]
        E_Store["Vector Store<br/>(FAISS)"]
        E_Chunk --> E_Embed --> E_Store
        E_Chunk --> E_KG
    end

    subgraph "Data Persistence"
        E --> F["SQLite Database<br/>(Metadata, KGs)"]
        E --> G["FAISS Vector Store<br/>(Embeddings)"]
    end

    subgraph "API & Query Layer"
        H["User"] --> I("FastAPI")
        I --> J["Chat & KG Services"]
        J --> E
        J --> F
        J --> G
        J --> H
    end

    linkStyle 8 stroke-width:2px,stroke:green,stroke-dasharray: 5 5;

✨ Core Features

Automated Content Ingestion: Seamlessly fetches video metadata and transcripts for individual videos or entire YouTube channels.
Advanced RAG Pipeline: Implements a multi-faceted retrieval strategy combining semantic search (via embeddings) and knowledge graph traversal to provide accurate, context-aware answers.
Intelligent Chunking: Employs sophisticated text chunking strategies, including fixed-size and semantic chunking, to prepare transcripts for efficient processing by Large Language Models (LLMs).
High-Performance Vector Storage: Utilizes FAISS (Facebook AI Similarity Search) for efficient storage and retrieval of text embeddings, forming the backbone of the semantic search capability.
Knowledge Graph Generation: Leverages LLMs to extract structured entities and their relationships from unstructured text, building a comprehensive knowledge graph of the content.
Asynchronous Task Processing: Uses a powerful combination of Celery and Redis to manage long-running tasks like transcript downloading and knowledge graph creation in the background, ensuring the API remains responsive.
Robust & Modern Backend: Built with FastAPI for high-performance, asynchronous API endpoints with automatic OpenAPI and Swagger documentation.
Clean Architecture: Follows a repository pattern to separate business logic from data access, enhancing maintainability and testability. The data layer is powered by SQLAlchemy ORM.
Comprehensive Tooling: Fully containerized with Docker and Docker Compose for easy setup and deployment. Includes a suite of code quality tools (black, flake8, mypy) and a CI/CD pipeline orchestrated with GitHub Actions.

🛠️ Technology Stack

Category	Technology
Backend Framework	FastAPI
Database	SQLite with SQLAlchemy (Async)
Async Tasks	Celery with Redis Broker
LLM Integration	LangChain, OpenAI API
Vector Store	FAISS (Facebook AI Similarity Search)
Dependency Mgmt	Poetry
Containerization	Docker, Docker Compose
Testing	pytest
Code Quality	black, flake8, mypy
CI/CD	GitHub Actions

🚀 Getting Started

Prerequisites

Docker and Docker Compose
Poetry
An OpenAI API key

Installation & Setup

Clone the repository:

git clone https://github.com/your-username/TubeAtlas.git
cd TubeAtlas

Set up environment variables: Create a .env file in the project root by copying the example:
```
cp .env.example .env
```
Now, edit the .env file and add your OPENAI_API_KEY.
Install dependencies: Use Poetry to install the project dependencies.
```
poetry install
```
Launch the application: Use Docker Compose to build and run all the services (API, Celery workers, Redis).
```
docker-compose up --build
```

The API will be available at http://localhost:8000.

🔌 API Usage

Once the application is running, you can access the interactive API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Example: Process a video

You can submit a YouTube video for transcript processing using a curl command:

curl -X 'POST' \
  'http://localhost:8000/api/v1/transcripts/video' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "video_url": "https://www.youtube.com/watch?v=your_video_id"
  }'

This will trigger an asynchronous background task to download and process the transcript.

🔬 Development & Testing

This project is equipped with a full suite of development tools to ensure code quality and correctness.

Run tests: Execute the test suite using pytest.
```
poetry run pytest
```

Check code formatting and linting:

poetry run black . --check
poetry run flake8 src tests

Run static type checking:
```
poetry run mypy src
```

🗺️ Roadmap

This project is under active development. The future roadmap includes:

Advanced RAG Strategies: Implementing hierarchical summarization and hybrid retrieval methods as outlined in the PRD.
Knowledge Graph Enhancements: Full implementation of KG-based retrieval, graph merging, and interactive visualizations.
Chat Interface: Building out the conversational chat endpoints for querying channels and knowledge graphs.
Scalability Improvements: Migrating to PostgreSQL for enhanced database performance and implementing more robust caching layers.
Frontend: Building a modern and responsive frontend for the application.

📄 License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.cursor		.cursor
.github		.github
.taskmaster		.taskmaster
data/raw		data/raw
docs		docs
legacy		legacy
src/tubeatlas		src/tubeatlas
tests		tests
.env.example		.env.example
.env.template		.env.template
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
Dockerfile		Dockerfile
Dockerfile.optimized		Dockerfile.optimized
PRD.md		PRD.md
README.md		README.md
advanced_rag_techniques.md		advanced_rag_techniques.md
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
mypy.ini		mypy.ini
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TubeAtlas

📖 Overview

🏗️ Architecture

✨ Core Features

🛠️ Technology Stack

🚀 Getting Started

Prerequisites

Installation & Setup

🔌 API Usage

Example: Process a video

🔬 Development & Testing

🗺️ Roadmap

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Languages

Nicolas2912/TubeAtlas

Folders and files

Latest commit

History

Repository files navigation

TubeAtlas

📖 Overview

🏗️ Architecture

✨ Core Features

🛠️ Technology Stack

🚀 Getting Started

Prerequisites

Installation & Setup

🔌 API Usage

Example: Process a video

🔬 Development & Testing

🗺️ Roadmap

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages