A Comparative Study of Vision Language Models for Italian Cultural Heritage

Chiara Vitaloni *, Dasara Shullani, Daniele Baracchi

Manuscript under review on Heritage in the Special Issue: AI and the Future of Cultural Heritage.

Abstract

Human communication has long relied on images for both active and passive interaction. For several decades now, electronic devices equipped with a screen have been used to search for and obtain visual data. Until recently, however, the flow of visual information was unidirectional, as input queries needed to be text fragments. At the same time, improvements in human-computer interaction technologies made it possible to query search engines such as Google using visual data (a technique known as “reverse image search”). In recent times, technologies like large language models have brought together these two approaches, enabling the inclusion of both textual questions and images within a single query. These tools have been explored in part by the scientific community for cultural heritage-related applications such as searching for information on artworks. In this context, this paper investigates the use of a wide range of Vision-Language Models (VLMs), including Bing’s search engine with GPT-4 and open models like Qwen2-VL and Pixtral, for cultural heritage visual question answering. To do so, twenty subjects were chosen to represent well-known Italian landmarks (i.e. Colosseo, Milan Cathedral, Michelangelo’s David in Florence). For each subject, two pictures were selected: one from Wikipedia and one either from a scientific database or from private collections of pictures. These images were input into each VLM alongside textual queries about their content. We studied the quality of the responses in terms of their completeness, assessing the impact of various levels of detail in the queries. Additionally, we evaluated the impact of language (English or Italian) on the system’s ability to provide satisfactory answers.

Keywords: visual question answering; cultural heritage; artificial intelligence; ChatGPT; human-centered approaches

Folder Organization

dataset contains all the images used in the anaylsis
results contains the responses in ITA/ENG provided by each open VLM
dataset-eval contains the responses and the evaluation
final-results contains the evaluation of all algorithms in Italian and in English

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset-eval		dataset-eval
dataset		dataset
final-results		final-results
results		results
.gitattributes		.gitattributes
README.md		README.md
analyze_results.ipynb		analyze_results.ipynb
cogvlm2_call.py		cogvlm2_call.py
dataset.csv		dataset.csv
deepseek-vl_call.py		deepseek-vl_call.py
eval_results.ipynb		eval_results.ipynb
internvl2_call.py		internvl2_call.py
llava1.6_call.py		llava1.6_call.py
molmo_call.py		molmo_call.py
phi3_call.py		phi3_call.py
pixtral_call.py		pixtral_call.py
qwen2vl_call.py		qwen2vl_call.py
smolvlm_call.py		smolvlm_call.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Comparative Study of Vision Language Models for Italian Cultural Heritage

Abstract

Folder Organization

About

Uh oh!

Releases

Packages

Languages

IAPP-Group/VLM-Heritage

Folders and files

Latest commit

History

Repository files navigation

A Comparative Study of Vision Language Models for Italian Cultural Heritage

Abstract

Folder Organization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages