MULTIMODAL_RAG is a robust Retrieval-Augmented Generation (RAG) system that supports both text and image modalities. It leverages advanced context management, a finite state machine (FSM) for query routing, and async/await for high-performance rendering of inline images and graphs. The project is designed for research, portfolio, and real-world applications where rich, context-aware responses are required.
MULTIMODAL_RAG/
├── rag_pipeline.py # Main RAG pipeline
├── streamlit_app.py # Streamlit UI for demo/visualization
├── inspect_db.py # Inspect ChromaDB contents
├── requirements.txt # Python dependencies
├── Dockerfile # Containerization
├── test.py # Test scripts
├── track.txt # Tracking file
├── data/
│ ├── text/ # Markdown files
│ ├── pdfs/ # PDF files
│ └── images/ # Image files
├── chroma_db/ # ChromaDB vector store
├── src/
│ ├── utils.py # Utility functions
│ ├── logger.py # Logging setup
│ ├── exception.py # Custom exceptions
│ ├── streamlit_utils.py # Streamlit helpers
│ └── __init__.py
└── .env # Environment variables
- Clone the repository:
git clone https://github.com/Shiv-Expert2503/MULTIMODAL_RAG.git cd MULTIMODAL_RAG
- Install dependencies:
pip install -r requirements.txt # For markdown/pdf/image support: pip install unstructured markdown pypdf2 pillow sentence-transformers chromadb langchain langchain-community langchain-google-genai python-dotenv
- Set up environment variables:
- Create a
.env
file with your Google API key:GOOGLE_API_KEY=your_google_api_key_here
- Create a
- Prepare your data:
- Place markdown files in
data/text/
, PDFs indata/pdfs/
, and images indata/images/
.
- Place markdown files in
- Text Data:
- Markdown files are loaded and chunked for semantic retrieval.
- PDF files are parsed and chunked using
langchain
loaders.
- Image Data:
- Images are embedded using CLIP (SentenceTransformer) and stored in ChromaDB.
- High-Quality RAG Enrichment:
- Each chunk is enriched with metadata (source, page, context).
- Embeddings are generated and stored in batches for efficiency.
- Text Embedding:
- Uses Google Generative AI Embeddings for text chunks.
- Image Embedding:
- Uses CLIP model for image embeddings.
- Storage:
- Embeddings and metadata are stored in ChromaDB collections (
portfolio_text
,portfolio_images
).
- Embeddings and metadata are stored in ChromaDB collections (
- Retrieval:
- Queries are matched against both text and image embeddings for multimodal responses.
- Purpose:
- Routes user queries to the correct handler (text/image/general).
- Maintains conversation state and topic transitions.
- Implementation:
- Each query is classified (topic, intent, similarity).
- FSM decides whether to answer, rewrite, or reject based on similarity and gap thresholds.
- Example:
if similarity > threshold and gap > min_gap: state = 'accepted' else: state = 'rejected'
- Context Memory:
- Stores previous queries, responses, and metadata for continuity.
- Enables context-aware answers and follow-ups.
- Cache:
- Frequently accessed queries and embeddings are cached for fast retrieval.
- Implemented as a local JSON or in-memory cache.
- Example:
cache = {} def get_from_cache(query): return cache.get(query)
- Async Processing:
- Embedding generation and retrieval are performed asynchronously for speed.
- Streamlit app uses async to render images and graphs inline without blocking UI.
- Example:
import asyncio async def embed_and_store(...): await embedding_model.encode_async(...)
- Inline Rendering:
- Images and graphs are displayed in real-time using Streamlit's
st.image
andst.pyplot
.
- Images and graphs are displayed in real-time using Streamlit's
- Run the RAG pipeline:
python rag_pipeline.py
- Start the Streamlit app for visualization:
streamlit run streamlit_app.py
- Interact with the system:
- Ask questions about text or image data.
- View inline images and graphs in the UI.
- Missing dependencies:
- Ensure all packages in
requirements.txt
are installed.
- Ensure all packages in
- ChromaDB errors:
- Delete and recreate the
chroma_db/
directory if corrupted.
- Delete and recreate the
- API key issues:
- Check
.env
for correct Google API key.
- Check
- Async errors:
- Ensure Python 3.8+ for async/await support.
- Fork the repo and submit pull requests.
- Open issues for bugs or feature requests.
- Follow best practices for code quality and documentation.
Author: Shivansh (Shiv-Expert2503)
License: MIT
Contact: GitHub Issues
This README provides a comprehensive guide to the MULTIMODAL_RAG project, covering everything from setup to advanced features like FSM, context memory, caching, and async rendering. For further details, refer to the code and comments in each module.