This project explores the use of CLIP (Contrastive Language–Image Pretraining) for bidirectional retrieval between images and text, with a focus on explainability. Users can query with either an image or a caption, retrieve the most relevant results from a dataset, and understand why those results were returned through attribution methods.
-
Text → Images and Captions
Image → Images and Captions
-
- Visual heatmaps for image relevance
- Word-level attribution for caption relevance
-
Uses the Flickr8k Dataset:
-
8,000 images, each with 5 descriptive captions
-
Diverse scenes and everyday activities
-
Ideal for testing multimodal AI systems
-
Clone & Set Up Environment
git clone https://github.com/sevdaimany/Interpretable-Cross-Modal-Retrieval-with-CLIP-and-Explainability.git
python -m venv env
source env/bin/activate # or .\env\Scripts\activate on Windows
pip install -r requirements.txtOptional: Install CUDA-Specific PyTorch
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url
https://download.pytorch.org/whl/cu124Prepare Dataset
python prepare_datasets.pyGenerate Embeddings
python generate_embeddings.pyLaunch Gradio App
python inference.pyImage (Visual Explanation)
-
Computes cosine similarity between patch-level embeddings and the query embedding.
-
Visualized as a heatmap overlaid on the original image.
-
Highlights regions that contributed most to retrieval.
Text (Caption Explanation)
-
Uses word occlusion attribution:
1) Each word is removed one at a time. 2) Similarity to the reference embedding is recomputed. 3) Drop in similarity quantifies that word’s importance. 4) Result is a heatmap-style HTML output highlighting important words.

