This project implements a Retrieval-Augmented Generation (RAG) chatbot using LangChain and Streamlit. It allows users to upload files (e.g., PDF, TXT, CSV, DOCX) and query their content, or chat generally with an AI assistant without uploading files. The chatbot supports conversational history, robust document loading, and error handling for issues like missing Poppler dependencies for PDFs.
- General Chat Mode: Chat with an OpenAI
gpt-4o-minimodel immediately after entering an API key, without needing to upload files. - RAG Chat Mode: Upload files, index them, and query their content using a retrieval-augmented approach.
- Supported File Formats: PDF, TXT, CSV, DOCX, Markdown, HTML, JSON, with fallbacks for unsupported formats.
- Robust Error Handling: Handles Poppler-related issues for PDFs and provides clear error messages.
- Debugging Tools: Displays document/chunk counts and retrieved context for troubleshooting.
- Conversational Memory: Maintains chat history across general and RAG modes for seamless interactions.
- Browser-Based UI: Built with Streamlit for an intuitive, web-based interface.
- Python: 3.8 or higher.
- OpenAI API Key: Required for LLM and embeddings. Sign up at OpenAI and obtain an API key.
- Poppler (for PDFs): Optional but recommended for PDF processing.
- Windows: Download from Poppler for Windows, extract, and add
binto PATH (e.g.,C:\poppler\bin). - macOS:
brew install poppler - Linux:
sudo apt update && sudo apt install poppler-utils - Verify:
pdftotext -v
- Windows: Download from Poppler for Windows, extract, and add
- System Requirements: At least 4GB RAM for small datasets; more for large files or complex queries.
- Clone or download this repository.
- Install Python dependencies:
pip install streamlit langchain langchain-openai langchain-community langchain-text-splitters langchainhub faiss-cpu openai pypdf docx2txt unstructured pip install "unstructured[pdf]" - (Optional) Install Poppler for PDF support (see Prerequisites).
- Save the script as
app.py.
- Run the Streamlit app:
streamlit run app.py
- Open the provided URL in your browser (e.g.,
http://localhost:8501). - Enter your OpenAI API key in the sidebar.
- General Chat:
- Start chatting immediately by typing questions in the chat input (e.g., "What is RAG?").
- The AI responds using
gpt-4o-miniwith conversational history.
- RAG Chat:
- Upload files (e.g., TXT, DOCX, PDF) via the file uploader.
- Click "Index Uploaded Files" to process and index the documents.
- Ask questions about the document content (e.g., "Summarize the document").
- Check the "Retrieved Context" expander to debug retrieved chunks.
- Reset: Click "Reset Chat History" to clear the conversation and indexed data.
- General Chat:
- Input: "What is AI?"
- Output: "AI is the simulation of human intelligence in machines, enabling tasks like learning and problem-solving."
- RAG Chat:
- Upload a
test.txtwith: "This document discusses Retrieval-Augmented Generation (RAG)..." - Index the file.
- Input: "What is RAG?"
- Output: "RAG is Retrieval-Augmented Generation, a method combining retrieval and generation for better AI responses."
- Upload a
- "I don't know" Responses:
- Ensure documents are indexed (check document/chunk counts in the UI).
- Verify query matches document content (use specific phrases or keywords).
- Check "Retrieved Context" expander to see if relevant chunks were retrieved.
- Try larger chunk sizes or more retrieved documents (edit
chunk_sizeorkin the script).
- Poppler Errors:
- If you see "Unable to get page count" or similar, install Poppler and ensure it’s in PATH.
- Alternatively, use non-PDF files (e.g., TXT, DOCX) or enable
UnstructuredFileLoader’sfaststrategy. - Verify:
pdftotext -v
- Document Loading Errors:
- Check the error message in the UI.
- Ensure files are not empty or corrupted.
- Try a simple TXT file to test the pipeline.
- API Key Issues:
- Ensure the OpenAI API key is valid and has sufficient quota.
- Check for typos in the sidebar input.
app.py: Main script containing the Streamlit app, document loading, RAG pipeline, and chat logic.- Dependencies: Managed via
pip(see Installation).
- Local LLMs: Integrate Hugging Face models for offline use.
- Multimodal Support: Add image or audio processing for advanced file types.
- Advanced UI: Enhance Streamlit with real-time file previews or query suggestions.
- Vector Store Options: Support Chroma or Pinecone for persistent storage.
This project is licensed under the MIT License.