Textbook RAG System

This repository provides a Retrieval-Augmented Generation (RAG) system designed to answer questions based on the content of a collection of PDF documents. The system processes, stores, and retrieves information from PDFs to generate context-aware responses to user queries.

System Architecture

The project is structured around three main stages: data ingestion, retrieval, and generation.

1. Data Ingestion

The ingestion pipeline processes PDF files and prepares them for retrieval.

Document Processing: PDF files placed in the textbooks directory are processed using the Docling library. This step handles document conversion and employs a hybrid chunking strategy to break the content into smaller text segments.
Embedding Generation: Each text chunk is converted into a vector embedding using OpenAI's text-embedding-3-small model.
Storage: The processed data is stored in a PostgreSQL database with the PGVector extension for vector db capabilities. The ingestion logic is managed by scripts/ingest_pdf.py. It checks the file hash of each document to avoid re-ingesting content that has not changed.

2. Retrieval

The retrieval stage finds the most relevant document chunks based on a user's query.

Query Expansion: The initial user query is expanded into multiple variations using Gemini 2.5 Flash to improve the breadth of the search.
Hybrid Search: A combination of semantic search (using pgvector for cosine similarity on embeddings) and keyword-based search is used to fetch an initial set of relevant chunks.
Reranking: The retrieved chunks are reranked using the BAAI/bge-reranker-v2-m3 model. This step scores the chunks against the original query and selects the top 5 most relevant results to use as context.

3. Generation

The final answer is generated by a large language model.

LLM Inference: The top-ranked chunks and the original query are passed to a language model (tested with Google's Gemini 2.5 Flash) via the OpenRouter.ai API to synthesize a final answer.

User Interface

The system provides both command-line and web-based interfaces for querying the RAG system:

Command Line Interface

Run queries directly from the terminal using: python src/scripts/query_rag.py

Streamlit Web Interface

For a more interactive experience, the system includes a Streamlit web application that provides:

Interactive Chat Interface: A conversational UI where users can ask questions and receive answers with proper formatting and markdown support.
Real-time Hyperparameter Control: Adjustable sliders in the sidebar allow users to fine-tune retrieval parameters on-the-fly:
- Retrieval Top K: Control how many document chunks to retrieve via semantic and keyword search.
- Reranker Top K: Limit the number of chunks added to context via reranker model sorting and truncating.
- Hybrid Retriever Alpha: Adjust the alpha used to control the weighting for scores from semantic and keyword searches.
- Number of Expanded Queries: Number of additional queries to construct for increase retrieval quality. Total number of retrieved chunks will be the value of retrieval top k multiplied by the number of expanded queries. However, only the reranker top k number of chunks will be added to the context window.
Source Transparency: The interface displays the relevant document chunks used to generate each answer, showing:
- The source document name and page numbers
- The actual text content that informed the response
- Similarity scores for each chunk
Session Management: Maintains conversation history within the session for better context.

Launch the Streamlit interface with: streamlit run src/scripts/streamlit_query_rag.py

The web interface makes it easy to experiment with different parameter settings and understand how the RAG system arrives at its answers by showing the underlying source material.

Database Schema

The system requires a PostgreSQL database with the pgvector extension enabled. The schema consists of two primary tables, which are created by the scripts/init_sql_tables.py script.

documents table: Stores metadata for each processed PDF.
- file_hash (Primary Key): A SHA256 hash of the document's content.
- doc_name, source_path, file_size, mime, page_count.
- ingested_at: A timestamp that is populated after the document's chunks are successfully ingested.
chunks table: Stores the individual text chunks and their embeddings.
- Contains the text, heading, page_start, page_end, and embedding for each chunk.
- file_hash (Foreign Key): Links each chunk back to its source document in the documents table.

Getting Started

Prerequisites

Python 3.x (I'm using 3.13)
PostgreSQL with the pgvector extension
An OpenAI API key
An OpenRouter.ai API key
Streamlit (included in environment.yml)

Installation and Setup

Clone the repository:
git clone \<repository-url\>
Install dependencies:
conda env create -f environment.yml
Configure environment variables:
Create a .env file in the project's root directory. See src/settings.py for the required variables, which include database connection details and API keys.

Usage

Run from the textbook-rag root directory

Initialize the database tables:
python -m src/scripts/init_sql_tables.py
Ingest documents:
Add your PDF files to the textbooks/ directory and run the ingestion script:
python -m src/scripts/ingest_pdf.py
Run queries:

Command line interface (not recommended): python -m src/scripts/query_rag.py

Web interface (recommended): python -m streamlit run src/scripts/streamlit_query_rag.py

The Streamlit interface will open in your browser (typically at http://localhost:8501) where you can interact with the RAG system through an intuitive web interface with real-time parameter adjustment and source document transparency.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.env-example		.env-example
.gitignore		.gitignore
README.md		README.md
environment-no-builds.yml		environment-no-builds.yml
environment.yml		environment.yml
textbook-rag.mp4		textbook-rag.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Textbook RAG System

System Architecture

1. Data Ingestion

2. Retrieval

3. Generation

User Interface

Command Line Interface

Streamlit Web Interface

Database Schema

Getting Started

Prerequisites

Installation and Setup

Usage

Run from the textbook-rag root directory

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Textbook RAG System

System Architecture

1. Data Ingestion

2. Retrieval

3. Generation

User Interface

Command Line Interface

Streamlit Web Interface

Database Schema

Getting Started

Prerequisites

Installation and Setup

Usage

Run from the textbook-rag root directory

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages