Skip to content

sheikhmunim/deep_learning_for_entailment_prediction-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Visual Entailment using Multimodal Cross-Attention

Deep Learning Assignment 2 โ€“ COSC2779 (RMIT University)


๐Ÿ“˜ Project Overview

This project explores Visual Entailment (VE) โ€” determining whether an image entails, contradicts, or is neutral with respect to a textual hypothesis.
We designed a multimodal cross-attention model that fuses textual and visual features to perform reasoning across both modalities.

Our approach combines:

  • BERT for textual representation
  • EfficientNetB0 for visual feature extraction
  • Cross-Attention Layer to align the two modalities
  • Joint classification head for entailment prediction

๐Ÿง  Motivation

Traditional models handle language and vision separately.
However, tasks like โ€œA man holding an umbrella โ†’ The image shows itโ€™s rainingโ€ require reasoning across both.
This project aims to bridge that gap by designing an interpretable, multimodal, and end-to-end framework.


๐Ÿงฉ Methodology

1. Data

We use a SNLI-VE-style dataset containing (image, premise, hypothesis, label) triples.
The dataset is split into train / val / test sets.

2. Architecture

  • Text Encoder: bert-base-uncased
  • Image Encoder: EfficientNetB0 (pretrained on ImageNet)
  • Fusion: Cross-attention over the joint feature space
  • Classifier: Two fully connected layers โ†’ 3-way softmax

3. Training Strategy

Two-stage training for stability and generalization:

  1. Stage 1 (Frozen Encoders): Train fusion + classifier only
  2. Stage 2 (Fine-tuning): Unfreeze top EfficientNet layers + last N BERT blocks

Optimizer: AdamW with weight decay
Learning Rate: Scheduled warmup and decay
Regularization: Dropout + Early Stopping


๐Ÿ“Š Results

Metric Validation Test
Accuracy ~77% ~76%
ROC-AUC 0.85 0.84
F1 (macro) 0.81 0.80

Interpretability:

  • Grad-CAM highlights image regions driving visual reasoning.
  • SHAP explains key textual tokens influencing the decision.


โš™๏ธ Installation

# Clone the repo
git clone https://github.com/sheikhmunim/deep_learning_for_entailment_prediction-.git
cd deep_learning_assign-2

# Create environment
python -m venv venv
source venv/bin/activate   # (or venv\Scripts\activate on Windows)

# Install dependencies
pip install -r requirements.txt

About

This repo contains code for entailment prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published