An advanced platform for large-scale YouTube content analysis, transforming unstructured video transcripts into structured knowledge graphs and enabling powerful, retrieval-augmented querying.
TubeAtlas is engineered to unlock the vast knowledge repository within YouTube. It automates the entire pipeline from content ingestion to insight generation, allowing researchers, creators, and analysts to deeply understand and interact with video content at scale.
The core of TubeAtlas is a sophisticated Retrieval-Augmented Generation (RAG) pipeline that intelligently processes transcripts, builds comprehensive knowledge graphs, and facilitates natural language conversations with the content of entire YouTube channels, even those containing millions of tokens.
The system is designed with a clean, scalable architecture that separates concerns into distinct layers. It leverages asynchronous processing to handle long-running, resource-intensive tasks efficiently.
graph TD
subgraph "Input Layer"
A["YouTube Channels/Videos"] --> B("YouTube Service")
end
subgraph "Processing Layer (Async)"
B --> C{"Celery + Redis"}
C --> D["Transcript Service"]
D --> E["RAG Pipeline"]
end
subgraph "RAG Pipeline"
direction LR
E_Chunk["Chunking<br/>(Semantic, Fixed)"]
E_Embed["Embedding<br/>(OpenAI)"]
E_KG["Graph Extraction<br/>(LLM)"]
E_Store["Vector Store<br/>(FAISS)"]
E_Chunk --> E_Embed --> E_Store
E_Chunk --> E_KG
end
subgraph "Data Persistence"
E --> F["SQLite Database<br/>(Metadata, KGs)"]
E --> G["FAISS Vector Store<br/>(Embeddings)"]
end
subgraph "API & Query Layer"
H["User"] --> I("FastAPI")
I --> J["Chat & KG Services"]
J --> E
J --> F
J --> G
J --> H
end
linkStyle 8 stroke-width:2px,stroke:green,stroke-dasharray: 5 5;
- Automated Content Ingestion: Seamlessly fetches video metadata and transcripts for individual videos or entire YouTube channels.
- Advanced RAG Pipeline: Implements a multi-faceted retrieval strategy combining semantic search (via embeddings) and knowledge graph traversal to provide accurate, context-aware answers.
- Intelligent Chunking: Employs sophisticated text chunking strategies, including fixed-size and semantic chunking, to prepare transcripts for efficient processing by Large Language Models (LLMs).
- High-Performance Vector Storage: Utilizes FAISS (Facebook AI Similarity Search) for efficient storage and retrieval of text embeddings, forming the backbone of the semantic search capability.
- Knowledge Graph Generation: Leverages LLMs to extract structured entities and their relationships from unstructured text, building a comprehensive knowledge graph of the content.
- Asynchronous Task Processing: Uses a powerful combination of Celery and Redis to manage long-running tasks like transcript downloading and knowledge graph creation in the background, ensuring the API remains responsive.
- Robust & Modern Backend: Built with FastAPI for high-performance, asynchronous API endpoints with automatic OpenAPI and Swagger documentation.
- Clean Architecture: Follows a repository pattern to separate business logic from data access, enhancing maintainability and testability. The data layer is powered by SQLAlchemy ORM.
- Comprehensive Tooling: Fully containerized with Docker and Docker Compose for easy setup and deployment. Includes a suite of code quality tools (
black
,flake8
,mypy
) and a CI/CD pipeline orchestrated with GitHub Actions.
Category | Technology |
---|---|
Backend Framework | FastAPI |
Database | SQLite with SQLAlchemy (Async) |
Async Tasks | Celery with Redis Broker |
LLM Integration | LangChain, OpenAI API |
Vector Store | FAISS (Facebook AI Similarity Search) |
Dependency Mgmt | Poetry |
Containerization | Docker, Docker Compose |
Testing | pytest |
Code Quality | black, flake8, mypy |
CI/CD | GitHub Actions |
- Docker and Docker Compose
- Poetry
- An OpenAI API key
-
Clone the repository:
git clone https://github.com/your-username/TubeAtlas.git cd TubeAtlas
-
Set up environment variables: Create a
.env
file in the project root by copying the example:cp .env.example .env
Now, edit the
.env
file and add yourOPENAI_API_KEY
. -
Install dependencies: Use Poetry to install the project dependencies.
poetry install
-
Launch the application: Use Docker Compose to build and run all the services (API, Celery workers, Redis).
docker-compose up --build
The API will be available at http://localhost:8000
.
Once the application is running, you can access the interactive API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
You can submit a YouTube video for transcript processing using a curl
command:
curl -X 'POST' \
'http://localhost:8000/api/v1/transcripts/video' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"video_url": "https://www.youtube.com/watch?v=your_video_id"
}'
This will trigger an asynchronous background task to download and process the transcript.
This project is equipped with a full suite of development tools to ensure code quality and correctness.
-
Run tests: Execute the test suite using
pytest
.poetry run pytest
-
Check code formatting and linting:
poetry run black . --check poetry run flake8 src tests
-
Run static type checking:
poetry run mypy src
This project is under active development. The future roadmap includes:
- Advanced RAG Strategies: Implementing hierarchical summarization and hybrid retrieval methods as outlined in the PRD.
- Knowledge Graph Enhancements: Full implementation of KG-based retrieval, graph merging, and interactive visualizations.
- Chat Interface: Building out the conversational chat endpoints for querying channels and knowledge graphs.
- Scalability Improvements: Migrating to PostgreSQL for enhanced database performance and implementing more robust caching layers.
- Frontend: Building a modern and responsive frontend for the application.
TBD