A web-based chatbot that allows you to ask questions about your PDF documents. It uses retrieval-augmented generation (RAG) to extract relevant chunks from PDFs and answer questions using OpenAI embeddings. Built with Streamlit for the frontend and FastAPI + LangChain for the backend.
Example of the chatbot UI showing uploaded PDFs, user queries, and retrieved sources with page numbers.
- Upload one or multiple PDFs to build a knowledge base.
- Ask questions about the PDFs and get context-aware answers.
- See source documents and page numbers for transparency.
- Maintains per-session chat history.
- Supports Chroma vector stores.
- Modern RAG pipeline using LangChain.
- Backend: Python, FastAPI, LangChain, OpenAI Embeddings
- Vector Store: Chroma
- Frontend: Streamlit
- PDF parsing:
PyPDFLoaderfromlangchain_community - Environment management:
.envfile for API keys
- Clone the repo:
git clone https://github.com/yourusername/pdf-rag-chatbot.git
cd pdf-rag-chatbot- Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txt- Add environment variables in
.env:
OPENAI_API_KEY=your_openai_api_keyuvicorn backend.main:app --reload- Backend will run at
http://localhost:8000.
streamlit run frontend/app.py- Frontend will run at
http://localhost:8501.
pdf-rag-chatbot/
├─ backend/
│ ├─ app/
│ │ ├─ services/
│ │ │ └─ rag_pipeline.py # RAG pipeline logic
│ │ └─ main.py # FastAPI endpoints
├─ frontend/
│ └─ app.py # Streamlit UI
├─ requirements.txt
├─ .env.example
└─ README.md
- Upload PDFs using the sidebar.
- Ask questions in the chat input.
- Toggle “Show sources” to see filenames and page numbers of retrieved chunks.
- Clear chat with the Clear Chat button.
-
PDF Ingestion: PDFs are loaded and split into chunks using
RecursiveCharacterTextSplitter. -
Embeddings: Chunks are embedded using OpenAI embeddings.
-
Vector Store: Chroma stores the embeddings for retrieval.
-
Querying:
- User question + chat history → LLM rewrites question as a standalone query.
- Retriever fetches relevant chunks.
- LLM answers based on retrieved chunks.
-
Chat History: Maintains conversation context per session.
- Duplicate chunks from the same page are automatically deduplicated when displaying sources.
- Page numbers, total pages, and source filenames are available for transparency.
- Supports multiple sessions using unique session IDs.
- Add multi-file search across all uploaded PDFs.
- Highlight the exact text in PDF where the answer was found.
- Support multimodal PDFs (images + text) using Google Gemini.
MIT License © [Towha Elahi]