Deep Learning Assignment 2 โ COSC2779 (RMIT University)
This project explores Visual Entailment (VE) โ determining whether an image entails, contradicts, or is neutral with respect to a textual hypothesis.
We designed a multimodal cross-attention model that fuses textual and visual features to perform reasoning across both modalities.
Our approach combines:
- BERT for textual representation
- EfficientNetB0 for visual feature extraction
- Cross-Attention Layer to align the two modalities
- Joint classification head for entailment prediction
Traditional models handle language and vision separately.
However, tasks like โA man holding an umbrella โ The image shows itโs rainingโ require reasoning across both.
This project aims to bridge that gap by designing an interpretable, multimodal, and end-to-end framework.
We use a SNLI-VE-style dataset containing (image, premise, hypothesis, label) triples.
The dataset is split into train / val / test sets.
- Text Encoder:
bert-base-uncased - Image Encoder:
EfficientNetB0(pretrained on ImageNet) - Fusion: Cross-attention over the joint feature space
- Classifier: Two fully connected layers โ 3-way softmax
Two-stage training for stability and generalization:
- Stage 1 (Frozen Encoders): Train fusion + classifier only
- Stage 2 (Fine-tuning): Unfreeze top EfficientNet layers + last N BERT blocks
Optimizer: AdamW with weight decay
Learning Rate: Scheduled warmup and decay
Regularization: Dropout + Early Stopping
| Metric | Validation | Test |
|---|---|---|
| Accuracy | ~77% | ~76% |
| ROC-AUC | 0.85 | 0.84 |
| F1 (macro) | 0.81 | 0.80 |
Interpretability:
- Grad-CAM highlights image regions driving visual reasoning.
- SHAP explains key textual tokens influencing the decision.
# Clone the repo
git clone https://github.com/sheikhmunim/deep_learning_for_entailment_prediction-.git
cd deep_learning_assign-2
# Create environment
python -m venv venv
source venv/bin/activate # (or venv\Scripts\activate on Windows)
# Install dependencies
pip install -r requirements.txt

