Interpretable Cross-Modal Retrieval with CLIP and Explainability

This project explores the use of CLIP (Contrastive Language–Image Pretraining) for bidirectional retrieval between images and text, with a focus on explainability. Users can query with either an image or a caption, retrieve the most relevant results from a dataset, and understand why those results were returned through attribution methods.

Features

Bidirectional retrieval:

Text → Images and Captions

Image → Images and Captions
Explainability:
- Visual heatmaps for image relevance
- Word-level attribution for caption relevance
Interactive Gradio app for uploading images or entering captions.

Dataset

Uses the Flickr8k Dataset:
- 8,000 images, each with 5 descriptive captions
- Diverse scenes and everyday activities
- Ideal for testing multimodal AI systems

How to Run

Clone & Set Up Environment

git clone https://github.com/sevdaimany/Interpretable-Cross-Modal-Retrieval-with-CLIP-and-Explainability.git

python -m venv env
source env/bin/activate  # or .\env\Scripts\activate on Windows
pip install -r requirements.txt

Optional: Install CUDA-Specific PyTorch

 pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url 
https://download.pytorch.org/whl/cu124

Prepare Dataset

  python prepare_datasets.py

Generate Embeddings

  python generate_embeddings.py

Launch Gradio App

python inference.py

Explainability Methods

Image (Visual Explanation)

Computes cosine similarity between patch-level embeddings and the query embedding.
Visualized as a heatmap overlaid on the original image.
Highlights regions that contributed most to retrieval.

Text (Caption Explanation)

Uses word occlusion attribution:

  1) Each word is removed one at a time.
  2) Similarity to the reference embedding is recomputed.
  3) Drop in similarity quantifies that word’s importance.
  4) Result is a heatmap-style HTML output highlighting important words.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
screenshots		screenshots
README.md		README.md
explain_image.py		explain_image.py
explain_text.py		explain_text.py
generate_embeddings.py		generate_embeddings.py
interface.py		interface.py
prepare_dataset.py		prepare_dataset.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interpretable Cross-Modal Retrieval with CLIP and Explainability

Features

Bidirectional retrieval:

Explainability:

Interactive Gradio app for uploading images or entering captions.

Dataset

How to Run

Explainability Methods

Screenshots

About

Uh oh!

Releases

Packages

Languages

sevdaimany/Interpretable-Cross-Modal-Retrieval-with-CLIP-and-Explainability

Folders and files

Latest commit

History

Repository files navigation

Interpretable Cross-Modal Retrieval with CLIP and Explainability

Features

Bidirectional retrieval:

Explainability:

Interactive Gradio app for uploading images or entering captions.

Dataset

How to Run

Explainability Methods

Screenshots

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages