📚 PDF RAG Chatbot

A web-based chatbot that allows you to ask questions about your PDF documents. It uses retrieval-augmented generation (RAG) to extract relevant chunks from PDFs and answer questions using OpenAI embeddings. Built with Streamlit for the frontend and FastAPI + LangChain for the backend.

🖼 Screenshot

Example of the chatbot UI showing uploaded PDFs, user queries, and retrieved sources with page numbers.

🧰 Features

Upload one or multiple PDFs to build a knowledge base.
Ask questions about the PDFs and get context-aware answers.
See source documents and page numbers for transparency.
Maintains per-session chat history.
Supports Chroma vector stores.
Modern RAG pipeline using LangChain.

⚡ Tech Stack

Backend: Python, FastAPI, LangChain, OpenAI Embeddings
Vector Store: Chroma
Frontend: Streamlit
PDF parsing: PyPDFLoader from langchain_community
Environment management: .env file for API keys

📦 Installation

Clone the repo:

git clone https://github.com/yourusername/pdf-rag-chatbot.git
cd pdf-rag-chatbot

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # Linux/macOS
venv\Scripts\activate     # Windows

Install dependencies:

pip install -r requirements.txt

Add environment variables in .env:

OPENAI_API_KEY=your_openai_api_key

🚀 Running the Project

1. Start the backend (FastAPI)

uvicorn backend.main:app --reload

Backend will run at http://localhost:8000.

2. Start the frontend (Streamlit)

streamlit run frontend/app.py

Frontend will run at http://localhost:8501.

🗂 Project Structure

pdf-rag-chatbot/
├─ backend/
│  ├─ app/
│  │  ├─ services/
│  │  │  └─ rag_pipeline.py    # RAG pipeline logic
│  │  └─ main.py               # FastAPI endpoints
├─ frontend/
│  └─ app.py                   # Streamlit UI
├─ requirements.txt
├─ .env.example
└─ README.md

📄 Usage

Upload PDFs using the sidebar.
Ask questions in the chat input.
Toggle “Show sources” to see filenames and page numbers of retrieved chunks.
Clear chat with the Clear Chat button.

🧠 How it Works

PDF Ingestion: PDFs are loaded and split into chunks using RecursiveCharacterTextSplitter.
Embeddings: Chunks are embedded using OpenAI embeddings.
Vector Store: Chroma stores the embeddings for retrieval.
Querying:
- User question + chat history → LLM rewrites question as a standalone query.
- Retriever fetches relevant chunks.
- LLM answers based on retrieved chunks.
Chat History: Maintains conversation context per session.

✅ Notes

Duplicate chunks from the same page are automatically deduplicated when displaying sources.
Page numbers, total pages, and source filenames are available for transparency.
Supports multiple sessions using unique session IDs.

📌 Future Improvements

Add multi-file search across all uploaded PDFs.
Highlight the exact text in PDF where the answer was found.
Support multimodal PDFs (images + text) using Google Gemini.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend/app		backend/app
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 PDF RAG Chatbot

🖼 Screenshot

🧰 Features

⚡ Tech Stack

📦 Installation

🚀 Running the Project

1. Start the backend (FastAPI)

2. Start the frontend (Streamlit)

🗂 Project Structure

📄 Usage

🧠 How it Works

✅ Notes

📌 Future Improvements

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📚 PDF RAG Chatbot

🖼 Screenshot

🧰 Features

⚡ Tech Stack

📦 Installation

🚀 Running the Project

1. Start the backend (FastAPI)

2. Start the frontend (Streamlit)

🗂 Project Structure

📄 Usage

🧠 How it Works

✅ Notes

📌 Future Improvements

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages