Skip to content

SapienzaNLP/Concept-pedia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Concept-pedia:
a Wide-coverage Semantically-annotated
Multimodal Dataset

Paper License: CC BY-NC-SA 4.0 Hugging Face Collection

Concept-pedia Overview

This repository contains the benchmark introduced in the paper: "Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset" by Karim Ghonim, Andrei Stefan Bejgu, Alberte Fernández-Castro, and Roberto Navigli.

Concept-pedia is a multimodal large-scale, semantically-annotated resource covering more than 165,0000 concepts grounded in Wikipedia. Concept-pedia expands the range of visual concepts, including diverse abstract categories. Building on Concept-pedia, we also present a manually-curated evaluation benchmark, Concept-10k, that spans thousands of concepts across a wide range of categories.


🔥 News

  • [2025-10-28] "Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset" accepted at EMNLP 2025!
  • [2025-10-28] Released Concept-pedia dataset and Concept-10k benchmark
  • [2025-10-28] Released SigLIP models fine-tuned on Concept-pedia on Hugging Face

Table of Contents


Installation

Install the required dependencies:

pip install torch torchvision transformers datasets pillow

Requirements:

  • Python 3.10+
  • PyTorch 2.0+
  • Transformers 4.30+
  • Datasets 2.0+

Concept-10k Benchmark

We introduce Concept-10k, a gold-standard benchmark for Visual Concept Recognition that covers 9,837 concepts, spanning a wider range of semanic categories than current benchmarks. Kindly refer to Section 4 of the paper for more details.

To download the raw images in Concept-10k, please check:

Loading the Dataset

You can easily load Concept-10k using the Hugging Face datasets library:

from datasets import load_dataset

# Load the benchmark dataset
dataset = load_dataset("sapienzanlp/Concept-10k", split="test")

# Load the images dataset
images_dataset = load_dataset("sapienzanlp/Concept-10k-imgs")

# Access a sample
sample = dataset['test'][0]
print(f"Concept: {sample['concept']}")
print(f"Image shape: {sample['image'].size}")

Concept-pedia Models

We present several VLMs that were fine-tuned using Concept-pedia:

Other VLMs may be available in the future 👀.

Model Performance

Model Parameters Accuracy@1 Concept-10k Download
SigLIP-base-ft 203M 36.1 🤗 Hub
SigLIP-large-ft 652M 41.9 🤗 Hub
SigLIP-so400m-ft 400M 45.0 🤗 Hub

Zero-shot Visual Concept Recognition

You can use the Concept-pedia fine-tuned models for zero-shot Visual Concept Recognition:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

# Load the Concept-pedia fine-tuned model
model = AutoModel.from_pretrained("sapienzanlp/siglip-base-patch16-256-ft-concept-pedia")
processor = AutoProcessor.from_pretrained("sapienzanlp/siglip-base-patch16-256-ft-concept-pedia")

# Load an image
image_path = "path_to_image"
image = Image.open(image_path)

# Define candidate concepts
texts = ["a photo of a monkey", "a photo of a horse", ...]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)  # these are the probabilities
print(f"{probs[0][0]:.1%} that the image is '{texts[0]}'")

Example output:

95.3% that image 0 is 'a photo of a monkey'

Evaluation

We provide a simple evaluation script in scripts/evaluate.py:

python scripts/evaluate.py \
    --model_name sapienzanlp/siglip-base-patch16-256-ft-concept-pedia \
    --dataset_name sapienzanlp/Concept-10k \
    --split test

Contributing

We welcome contributions! Here's how you can help:

  • 🐛 Report bugs or issues
  • 💡 Suggest new features or improvements
  • 🔬 Share your results using Concept-pedia

Please open an issue on GitHub to discuss your contribution before submitting a pull request.


License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

If you use Concept-pedia or Concept-10k in your work, please cite our paper:

@inproceedings{ghonim-etal-2025-concept,
    title = "Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset",
    author = "Ghonim, Karim  and
      Bejgu, Andrei Stefan  and
      Fern{\'a}ndez-Castro, Alberte  and
      Navigli, Roberto",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1745/",
    pages = "34405--34426",
    ISBN = "979-8-89176-332-6",
    abstract = "Vision-language Models (VLMs), such as CLIP and SigLIP, have become the de facto standard for multimodal tasks, serving as essential building blocks for recent Multimodal Large Language Models, including LLaVA and PaliGemma. However, current evaluations for VLMs remain heavily anchored to ImageNet. In this paper, we question whether ImageNet{'}s coverage is still sufficiently challenging for modern VLMs, and investigate the impact of adding novel and varied concept categories, i.e., semantically grouped fine-grained synsets. To this end, we introduce Concept-pedia, a novel, large-scale, semantically-annotated multimodal resource covering more than 165,000 concepts. Leveraging a language-agnostic, automatic annotation pipeline grounded in Wikipedia, Concept-pedia expands the range of visual concepts, including diverse abstract categories. Building on Concept-pedia, we also present a manually-curated Visual Concept Recognition evaluation benchmark, Concept-10k, that spans thousands of concepts across a wide range of categories. Our experiments show that current models, although excelling on ImageNet, struggle with Concept-10k. Not only do these findings highlight a persistent bias toward ImageNet-centric concepts, but they also underscore the urgent need for more representative benchmarks. By offering a broader and semantically richer testbed, Concept-10k aims to support the development of multimodal systems that better generalize to the complexities of real-world visual concepts."
}

Acknowledgments

  • This work was conducted by the Sapienza NLP Group and Babelscape.

  • We gratefully acknowledge the CREATIVE project (CRoss-modal understanding and gEnerATIon of Visual and tExtual content), which is funded by the MUR Progetti di Ricerca di Rilevante Interesse Nazionale programme (PRIN 2020)

  • We also gratefully acknowledge the support of the PNRR MUR project PE0000013-FAIR.

Contact

For questions or collaborations, please contact: [ghonim@diag.uniroma1.it]


About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages