Skip to content

multi-modal pipeline for querying invoices and receipts. Integrated CLIP embeddings for image understanding, ChromaDB for retrieval, and Google Gemini for reasoning and Q&A.

Notifications You must be signed in to change notification settings

MeghneelG0/Multi-Modal-RAG

Repository files navigation

hostingLLM: Multi-modal RAG Pipeline

A basic, end-to-end Retrieval-Augmented Generation (RAG) pipeline that combines image and text retrieval with generative AI. Uses Google Gemini for multi-modal generation, OpenAI CLIP for embeddings, and ChromaDB for vector search.

Features

  • Multi-modal (image + text) semantic search and extraction
  • Google Gemini LLM integration for document analysis
  • OpenAI CLIP for image and text embeddings
  • ChromaDB for fast vector search
  • Docker support for easy deployment

Setup

  1. Clone the repo:
    git clone <your-repo-url>
    cd hostingLLM
  2. Install dependencies:
    pip install -r requirements.txt
    Or use Docker:
    docker build -t hostingllm .
    docker run --env-file .env hostingllm
  3. Set up your .env file:
    GOOGLE_API_KEY=your_google_api_key_here

Usage

1. Index your images

python index_image.py

This will embed all images in the docs/ folder and store them in ChromaDB.

2. Retrieve and generate

Edit retrieve_and_generate.py to set your query and prompt, then run:

python retrieve_and_generate.py

3. Gemini single-image demo

python geminivllm.py

This runs a simple Gemini demo on a single image and prompt.

4. Main demo (optional, text-only RAG)

python main.py

This runs a text-only RAG pipeline using SentenceTransformers, FAISS, and vllm.

Environment Variables

  • GOOGLE_API_KEY: Your Google Gemini API key (required)

Folder Structure

  • index_image.py — Indexes images into ChromaDB
  • retrieve_and_generate.py — Retrieves relevant images and runs Gemini
  • geminivllm.py — Gemini single-image demo
  • main.py — (Optional) Text-only RAG demo
  • docs/ — Sample images and test files

License

MIT

About

multi-modal pipeline for querying invoices and receipts. Integrated CLIP embeddings for image understanding, ChromaDB for retrieval, and Google Gemini for reasoning and Q&A.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages