Skip to content

Shiv-Expert2503/MULTIMODAL_RAG

Repository files navigation

MULTIMODAL_RAG: Multimodal Retrieval-Augmented Generation System

Table of Contents


Project Overview

MULTIMODAL_RAG is a robust Retrieval-Augmented Generation (RAG) system that supports both text and image modalities. It leverages advanced context management, a finite state machine (FSM) for query routing, and async/await for high-performance rendering of inline images and graphs. The project is designed for research, portfolio, and real-world applications where rich, context-aware responses are required.

Directory Structure

MULTIMODAL_RAG/
├── rag_pipeline.py           # Main RAG pipeline
├── streamlit_app.py          # Streamlit UI for demo/visualization
├── inspect_db.py             # Inspect ChromaDB contents
├── requirements.txt          # Python dependencies
├── Dockerfile                # Containerization
├── test.py                   # Test scripts
├── track.txt                 # Tracking file
├── data/
│   ├── text/                 # Markdown files
│   ├── pdfs/                 # PDF files
│   └── images/               # Image files
├── chroma_db/                # ChromaDB vector store
├── src/
│   ├── utils.py              # Utility functions
│   ├── logger.py             # Logging setup
│   ├── exception.py          # Custom exceptions
│   ├── streamlit_utils.py    # Streamlit helpers
│   └── __init__.py
└── .env                      # Environment variables

Installation & Setup

  1. Clone the repository:
    git clone https://github.com/Shiv-Expert2503/MULTIMODAL_RAG.git
    cd MULTIMODAL_RAG
  2. Install dependencies:
    pip install -r requirements.txt
    # For markdown/pdf/image support:
    pip install unstructured markdown pypdf2 pillow sentence-transformers chromadb langchain langchain-community langchain-google-genai python-dotenv
  3. Set up environment variables:
    • Create a .env file with your Google API key:
      GOOGLE_API_KEY=your_google_api_key_here
      
  4. Prepare your data:
    • Place markdown files in data/text/, PDFs in data/pdfs/, and images in data/images/.

Data Preparation

  • Text Data:
    • Markdown files are loaded and chunked for semantic retrieval.
    • PDF files are parsed and chunked using langchain loaders.
  • Image Data:
    • Images are embedded using CLIP (SentenceTransformer) and stored in ChromaDB.
  • High-Quality RAG Enrichment:
    • Each chunk is enriched with metadata (source, page, context).
    • Embeddings are generated and stored in batches for efficiency.

RAG Pipeline

  • Text Embedding:
    • Uses Google Generative AI Embeddings for text chunks.
  • Image Embedding:
    • Uses CLIP model for image embeddings.
  • Storage:
    • Embeddings and metadata are stored in ChromaDB collections (portfolio_text, portfolio_images).
  • Retrieval:
    • Queries are matched against both text and image embeddings for multimodal responses.

Advanced Features

Finite State Machine (FSM)

  • Purpose:
    • Routes user queries to the correct handler (text/image/general).
    • Maintains conversation state and topic transitions.
  • Implementation:
    • Each query is classified (topic, intent, similarity).
    • FSM decides whether to answer, rewrite, or reject based on similarity and gap thresholds.
    • Example:
      if similarity > threshold and gap > min_gap:
          state = 'accepted'
      else:
          state = 'rejected'

Context Memory & Caching

  • Context Memory:
    • Stores previous queries, responses, and metadata for continuity.
    • Enables context-aware answers and follow-ups.
  • Cache:
    • Frequently accessed queries and embeddings are cached for fast retrieval.
    • Implemented as a local JSON or in-memory cache.
    • Example:
      cache = {}
      def get_from_cache(query):
          return cache.get(query)

Async/Await for Fast Rendering

  • Async Processing:
    • Embedding generation and retrieval are performed asynchronously for speed.
    • Streamlit app uses async to render images and graphs inline without blocking UI.
    • Example:
      import asyncio
      async def embed_and_store(...):
          await embedding_model.encode_async(...)
  • Inline Rendering:
    • Images and graphs are displayed in real-time using Streamlit's st.image and st.pyplot.

Usage

  1. Run the RAG pipeline:
    python rag_pipeline.py
  2. Start the Streamlit app for visualization:
    streamlit run streamlit_app.py
  3. Interact with the system:
    • Ask questions about text or image data.
    • View inline images and graphs in the UI.

Troubleshooting

  • Missing dependencies:
    • Ensure all packages in requirements.txt are installed.
  • ChromaDB errors:
    • Delete and recreate the chroma_db/ directory if corrupted.
  • API key issues:
    • Check .env for correct Google API key.
  • Async errors:
    • Ensure Python 3.8+ for async/await support.

Contributing

  • Fork the repo and submit pull requests.
  • Open issues for bugs or feature requests.
  • Follow best practices for code quality and documentation.

Author: Shivansh (Shiv-Expert2503)

License: MIT

Contact: GitHub Issues


This README provides a comprehensive guide to the MULTIMODAL_RAG project, covering everything from setup to advanced features like FSM, context memory, caching, and async rendering. For further details, refer to the code and comments in each module.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published