aimagelab / awesome-captioning-evaluation Public

Notifications You must be signed in to change notification settings
Fork 0
Star 21

[IJCAI 2025] Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

21 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
evaluation		evaluation
images		images
metrics		metrics
models		models
test_captions		test_captions
utils		utils
README.md		README.md
compute_all_metric.py		compute_all_metric.py
requirements.txt		requirements.txt

Repository files navigation

Image Captioning Evaluation

This repository contains a curated list of research papers and resources focusing on image captioning evaluation.

❗ Latest Update: 4 August 2025. ❗ This repo is a work in progress. New updates coming soon, stay tuned!! 🚧

👩‍💻 🔜 Code for Reproducing Metric Scores

We leverage publicly available codes and have designed a unified framework that enables the reproduction of all metrics within a single repository.

Environment Setup

Clone the repository and create your environment using the requirements.txt file.

🔍 Note: The requirements.txt file is included for convenience. This environment setup builds upon the Polos repository. We started from Polos and extended it by adding the necessary dependencies for our own evaluation framework. For full dependency details and background, please also refer to the original Polos repository.

Loading Models

Model checkpoints for the various backbones used in this project can be downloaded from their respective official repositories. After downloading, place all checkpoints inside the checkpoints/ directory at the root of this repository.

Checkpoint paths are managed via a configuration file (config/model_paths.json), allowing you to define custom locations for each model. The format has to be <metric_name>_<clip_model>.

Compute Metrics

To run the evaluation, simply execute python -u compute_all_metric.py.

You can specify which metrics to compute using the --compute_metric_type argument. Available options include: ['standard', 'clip-score', 'pac-score', 'pac-score++', 'polos'].

🔍 Note: The corresponding reference-based version of each metric (e.g., RefPAC for PAC) will always be computed automatically.

The default backbone used is the CLIP ViT-B-32 model. To use a different backcbone (e.g. OpenCLIP ViT-L/14 backbone) specify in the command input --clip_model open_clip_ViT-L/14.

We provide a set of generated captions from various image captioning models in the test_captions/ directory. These captions are evaluated on the COCO test split (5K samples).

Additionally, we include the corresponding reference captions in test_captions/reference_captions.json

🔍 Note: To run evaluations involving image features (e.g., CLIP-based metrics), you will need the actual COCO images. Please download the COCO val2014 image set from the official COCO dataset site and ensure it's accessible during evaluation.

🔥🔥 Our Survey

Image Captioning Evaluation in the Age of Multimodal LLMs:
Challenges and Future Perspectives

Authors: Sara Sarto, Marcella Cornia, Rita Cucchiara

Please cite with the following BibTeX:

@inproceedings{sarto2025image,
  title={{Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives}},
  author={Sarto, Sara and Cornia, Marcella and Cucchiara, Rita},
  booktitle={arxiv}
  year={2025}
}

📚 Table of Contents

The Evolution of Captioning Metrics

Rule-based Metrics

Year	Conference / Journal	Title	Authors	Links
2002	ACL	BLEU: A method for automatic evaluation of machine translation	Kishore Papineni et al.	📜 Paper
2004	ACLW	ROUGE: A package for automatic evaluation of summaries	Chin-Yew Lin	📜 Paper
2005	ACLW	METEOR: An automatic metric for MT evaluation with improved correlation with human judgments	Satanjeev Banerjee et al.	📜 Paper
2015	CVPR	CIDEr: Consensus-based Image Description Evaluation	Ramakrishna Vedantam et al.	📜 Paper
2016	ECCV	SPICE: Semantic Propositional Image Caption Evaluation	Peter Anderson et al.	📜 Paper

Learnable Metrics

Unsupervised Metrics

Year	Conference / Journal	Title	Authors	Links
2019	EMNLP	TIGEr: Text-to-Image Grounding for Image Caption Evaluation	Ming Jiang et al.	📜 Paper
2020	ICLR	BERTScore: Evaluating Text Generation with BERT	Tianyi Zhang et al.	📜 Paper
2020	ACL	Improving Image Captioning Evaluation by Considering Inter References Variance	Yanzhi Yi et al.	📜 Paper
2020	EMNLP	ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT	Hwanhee Lee et al.	📜 Paper
2021	EMNLP	CLIPScore: A Reference-free Evaluation Metric for Image Captioning	Jack Hessel et al.	📜 Paper
2021	CVPR	FAIEr: Fidelity and Adequacy Ensured Image Caption Evaluation	Sijin Wang et al.	📜 Paper
2021	ACL	UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning	Hwanhee Lee et al.	📜 Paper
2022	NeurIPS	Mutual Information Divergence: A Unified Metric for Multimodal Generative Models	Jin-Hwa Kim et al.	📜 Paper
2023	CVPR	PAC-S: Improving CLIP for Image Caption Evaluation via Positive Augmentations	Sara Sarto et al.	📜 Paper

Supervised Metrics

Year	Conference / Journal	Title	Authors	Links
2024	CVPR	Polos: Multimodal Metric Learning from Human Feedback for Image Captioning	Yuiga Wada et al.	📜 Paper
2024	ACCV	DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning	Kazuki Matsuda et al.	📜 Paper

Fine-grained Oriented Metrics

Year	Conference / Journal	Title	Authors	Links
2023	ACL	InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation	Anwen Hu et al.	📜 Paper
2024	ACM MM	HICEScore: A Hierarchical Metric for Image Captioning Evaluation	Zequn Zeng et al.	📜 Paper
2024	ECCV	BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues	Sara Sarto et al.	📜 Paper
2024	ECCV	HiFi-Score: Fine-Grained Image Description Evaluation with Hierarchical Parsing Graphs	Ziwei Yao et al.	📜 Paper

LLMs-based Metrics

Year	Conference / Journal	Title	Authors	Links
2023	EMNLP	CLAIR: Evaluating Image Captions with Large Language Models	David Chan et al.	📜 Paper
2024	ACL	FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model	Yebin Lee et al.	📜 Paper

Hallucinations-based Metrics

Year	Conference / Journal	Title	Authors	Links
2018	EMNLP	Object Hallucination in Image Captioning	Anna Rohrbach et al.	📜 Paper
2024	NAACL	ALOHa: A New Measure for Hallucination in Captioning Models	Suzanne Petryk et al.	📜 Paper

Datasets & Benchmarks 📂📎
- Correlation with Human Judgment
- Pairwise Ranking
  - Pascal50-S
- Sensitivity to Object Hallucinations
  - FOIL dataset

How to Contribute 🚀

Fork this repository and clone it locally.
Create a new branch for your changes: git checkout -b feature-name.
Make your changes and commit them: git commit -m 'Description of the changes'.
Push to your fork: git push origin feature-name.
Open a pull request on the original repository by providing a description of your changes.

This project is in constant development, and we welcome contributions to include the latest research papers in the field or report issues 💥.

About

[IJCAI 2025] Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Custom properties

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%