A Streamlit-based chatbot powered by Retrieval-Augmented Generation (RAG) and OpenAI. Upload your PDFs and chat with them! This app leverages LangChain, FAISS, and OpenAI’s GPT models to extract and query document content with metadata-aware answers.
- 🔍 Upload multiple PDFs and query across all of them
- 📄 Metadata-rich answers with filename and page references
- 🧠 Uses LangChain + FAISS for semantic search
- 🤖 Streamlit Chat UI for natural conversation
- 💾 OpenAI API support with streaming responses
.
├── .gitignore
├── LICENSE
├── README.md # ← You're reading it
├── app.py # Main Streamlit app
├── brain.py # PDF parsing and vector index logic
├── compare medium.gif # Optional UI illustration
├── requirements.txt # Python dependencies
└── thumbnail.webp # Preview image
git clone https://github.com/shamazhooda/chatbot-rag-langchain.git
cd chatbot-rag-langchainpip install -r requirements.txtCreate a .streamlit/secrets.toml file with:
OPENAI_API_KEY = "your-openai-key"Or export it via environment variable:
export OPENAI_API_KEY="your-openai-key"streamlit run app.py- Upload PDFs via the UI
- Each PDF is parsed using
PyPDF2and chunked via LangChain’sRecursiveCharacterTextSplitter - Chunks are embedded using OpenAI Embeddings
- Stored in a FAISS vector store for semantic similarity search
- Queries are matched to top PDF chunks and passed to ChatGPT with context
- Answers include file name and page number metadata for citation
- Streamlit – UI framework
- LangChain – PDF chunking and retrieval
- FAISS – Vector search backend
- OpenAI GPT – LLM-based answer generation
- PyPDF2 – PDF parsing
"What are the main points from the introduction?"
Answer: The introduction highlights... (example.pdf, page 1)
- UI:
app.py(Streamlit) - RAG pipeline:
brain.py(parse → chunk → embed → index) - Model: OpenAI Chat Completions API (streaming)
- Retriever: FAISS similarity search
- PDFs: in-memory during session (not persisted by default).
- Chunks/metadata: in-memory
Documentobjects. - Vector store: FAISS, created in-memory and cached via
st.cache_resource. Lives for the server process lifetime; cleared on restart or cache clear. - Chat history:
st.session_state(per browser session).
- To keep vectors across restarts, persist FAISS:
# brain.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
def save_index_local(vectordb: FAISS, path: str):
vectordb.save_local(path)
def load_index_local(path: str, openai_api_key: str) -> FAISS:
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
return FAISS.load_local(path, embeddings, allow_dangerous_deserialization=True)# app.py (around create_vectordb)
DB_DIR = "data/faiss_index"
if os.path.isdir(DB_DIR):
vectordb = load_index_local(DB_DIR, OPENAI_API_KEY)
else:
vectordb = get_index_for_pdf([f.getvalue() for f in files], filenames, OPENAI_API_KEY)
save_index_local(vectordb, DB_DIR)-
app.py- Loads
OPENAI_API_KEYfromst.secretsor environment. - Uploads PDFs with
st.file_uploader. - Builds or retrieves the FAISS index using
@st.cache_resourceincreate_vectordb(...). - On a question:
- Retrieves top-k chunks with
vectordb.similarity_search(...). - Injects those chunks into a system prompt.
- Streams model output using OpenAI’s v1 Chat Completions API.
- Retrieves top-k chunks with
- Loads
-
brain.pyparse_pdf(...): usespypdfto extract text from each page and normalizes whitespace/hyphenation.text_to_docs(...): splits text into chunks usingRecursiveCharacterTextSplitter, attachesfilename,page,chunkmetadata.docs_to_index(...): creates a FAISS index viaFAISS.from_documents(...)withOpenAIEmbeddings.get_index_for_pdf(...): orchestrates PDF parse → chunk → embed → FAISS index.
- Upload PDFs →
parse_pdfextracts text per page. - Text →
text_to_docscreates chunkedDocumentobjects with metadata. - Docs →
docs_to_indexembeds withOpenAIEmbeddingsand builds a FAISS index. - On a user question →
similarity_search(k=3)returns the most relevant chunks. - App forms a system prompt with those chunks and streams a response from the model.
- The UI displays tokens as they arrive.
- Your OpenAI key is required; store it in
.streamlit/secrets.tomlor environment. - Uploaded files stay in memory and are not written to disk unless you add persistence.
- FAISS index is in-memory unless you add the optional save/load shown above.
<code_block_to_apply_changes_from>- Set
OPENAI_API_KEYin.streamlit/secrets.tomlor as an env var. pip install -r requirements.txtstreamlit run app.py
- Want me to apply these README edits and add the optional persistence helpers to `brain.py` for you?