Skip to content

This project explores the use of CLIP (Contrastive Language–Image Pretraining) for bidirectional retrieval between images and text, with a focus on explainability. Users can query with either an image or a caption, retrieve the most relevant results from a dataset, and understand why those results were returned through attribution methods.

Notifications You must be signed in to change notification settings

sevdaimany/Interpretable-Cross-Modal-Retrieval-with-CLIP-and-Explainability

Repository files navigation

Interpretable Cross-Modal Retrieval with CLIP and Explainability

This project explores the use of CLIP (Contrastive Language–Image Pretraining) for bidirectional retrieval between images and text, with a focus on explainability. Users can query with either an image or a caption, retrieve the most relevant results from a dataset, and understand why those results were returned through attribution methods.

Features

  • Bidirectional retrieval:

    Text → Images and Captions

    Image → Images and Captions

  • Explainability:

    • Visual heatmaps for image relevance
    • Word-level attribution for caption relevance
  • Interactive Gradio app for uploading images or entering captions.

Dataset

  • Uses the Flickr8k Dataset:

    • 8,000 images, each with 5 descriptive captions

    • Diverse scenes and everyday activities

    • Ideal for testing multimodal AI systems

How to Run

Clone & Set Up Environment

git clone https://github.com/sevdaimany/Interpretable-Cross-Modal-Retrieval-with-CLIP-and-Explainability.git

python -m venv env
source env/bin/activate  # or .\env\Scripts\activate on Windows
pip install -r requirements.txt

Optional: Install CUDA-Specific PyTorch

 pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url 
https://download.pytorch.org/whl/cu124

Prepare Dataset

  python prepare_datasets.py

Generate Embeddings

  python generate_embeddings.py

Launch Gradio App

python inference.py

Explainability Methods

Image (Visual Explanation)

  • Computes cosine similarity between patch-level embeddings and the query embedding.

  • Visualized as a heatmap overlaid on the original image.

  • Highlights regions that contributed most to retrieval.

Text (Caption Explanation)

  • Uses word occlusion attribution:

      1) Each word is removed one at a time.
      2) Similarity to the reference embedding is recomputed.
      3) Drop in similarity quantifies that word’s importance.
      4) Result is a heatmap-style HTML output highlighting important words.
    

Screenshots

Image Input → Caption & Image Retrieval

Text Input → Image & Caption Retrieval

About

This project explores the use of CLIP (Contrastive Language–Image Pretraining) for bidirectional retrieval between images and text, with a focus on explainability. Users can query with either an image or a caption, retrieve the most relevant results from a dataset, and understand why those results were returned through attribution methods.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages