This project is focused on Multimodal Sarcasm Explanation (MuSE). The goal is to generate natural language explanations for sarcastic social media posts by leveraging both textual and visual information. It is based on a simplified version of the TURBO architecture proposed in this PAPER.
The dataset includes sarcastic posts from Twitter, Instagram, and Tumblr, and for each post:
- An image
- A text caption
- A sarcasm explanation
- A sarcasm target
train_df.tsv
,val_df.tsv
,test_df.tsv
: Main data withpid
,text
,explanation
,target_of_sarcasm
D_*.pkl
: Image descriptions (from BLIP or similar model)O_*.pkl
: Object detection labels (from YOLOv9)images/
: Folder with all post images
facebook/bart-base
: Pretrained BART model for conditional text generation
google/vit-base-patch16-224-in21k
: Pretrained Vision Transformer
A custom module that:
- Applies multi-head self-attention to text and image embeddings
- Computes gated cross-modal attention (text-guided vision and vision-guided text)
- Produces a fused representation used as
inputs_embeds
to BART
The sarcasm target is concatenated to the input and influences the explanation.
Model outputs were evaluated on the validation and test sets using:
Metric | Score |
---|---|
BLEU-1 | 0.5394 |
BLEU-2 | 0.4449 |
BLEU-3 | 0.3830 |
BLEU-4 | 0.3296 |
ROUGE-1 | 0.5127 |
ROUGE-2 | 0.3536 |
ROUGE-L | 0.4835 |
ROUGE-Lsum | 0.4837 |
METEOR | 0.5167 |
BERTScore (F1) | 0.4835 |
These scores are competitive with state-of-the-art TURBO model results.
├── main.ipynb # Main notebook
├── shared_fusion_epochN.py # Saved custom fusion module checkpoints
├── bart_gen_epochN.pt # Saved BART model checkpoints
├── shared_fusion_epochN.pt # Saved fusion model checkpoints
├── MORE-PLUS-DATASET/ # Folder for .tsv, .pkl, and images
├── test_predictions.tsv # Generated sarcasm explanations
├── README.md # This file
Feel free to fork and improve!