Concept-pedia:
a Wide-coverage Semantically-annotated
Multimodal Dataset

This repository contains the benchmark introduced in the paper: "Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset" by Karim Ghonim, Andrei Stefan Bejgu, Alberte Fernández-Castro, and Roberto Navigli.

Concept-pedia is a multimodal large-scale, semantically-annotated resource covering more than 165,0000 concepts grounded in Wikipedia. Concept-pedia expands the range of visual concepts, including diverse abstract categories. Building on Concept-pedia, we also present a manually-curated evaluation benchmark, Concept-10k, that spans thousands of concepts across a wide range of categories.

🔥 News

[2025-10-28] "Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset" accepted at EMNLP 2025!
[2025-10-28] Released Concept-pedia dataset and Concept-10k benchmark
[2025-10-28] Released SigLIP models fine-tuned on Concept-pedia on Hugging Face

Installation

Install the required dependencies:

pip install torch torchvision transformers datasets pillow

Requirements:

Python 3.10+
PyTorch 2.0+
Transformers 4.30+
Datasets 2.0+

Concept-10k Benchmark

We introduce Concept-10k, a gold-standard benchmark for Visual Concept Recognition that covers 9,837 concepts, spanning a wider range of semanic categories than current benchmarks. Kindly refer to Section 4 of the paper for more details.

Sapienzanlp/Concept-10k

To download the raw images in Concept-10k, please check:

Sapienzanlp/Concept-10k-imgs

Loading the Dataset

You can easily load Concept-10k using the Hugging Face datasets library:

from datasets import load_dataset

# Load the benchmark dataset
dataset = load_dataset("sapienzanlp/Concept-10k", split="test")

# Load the images dataset
images_dataset = load_dataset("sapienzanlp/Concept-10k-imgs")

# Access a sample
sample = dataset['test'][0]
print(f"Concept: {sample['concept']}")
print(f"Image shape: {sample['image'].size}")

Concept-pedia Models

We present several VLMs that were fine-tuned using Concept-pedia:

SigLIP-base: SigLIP-base-ft-concept-pedia
SigLIP-large: SigLIP-large-ft-concept-pedia
SigLIP-so400m: Siglip-so400m-ft-concept-pedia

Other VLMs may be available in the future 👀.

Model Performance

Model	Parameters	Accuracy@1 Concept-10k	Download
SigLIP-base-ft	203M	36.1	🤗 Hub
SigLIP-large-ft	652M	41.9	🤗 Hub
SigLIP-so400m-ft	400M	45.0	🤗 Hub

Zero-shot Visual Concept Recognition

You can use the Concept-pedia fine-tuned models for zero-shot Visual Concept Recognition:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

# Load the Concept-pedia fine-tuned model
model = AutoModel.from_pretrained("sapienzanlp/siglip-base-patch16-256-ft-concept-pedia")
processor = AutoProcessor.from_pretrained("sapienzanlp/siglip-base-patch16-256-ft-concept-pedia")

# Load an image
image_path = "path_to_image"
image = Image.open(image_path)

# Define candidate concepts
texts = ["a photo of a monkey", "a photo of a horse", ...]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)  # these are the probabilities
print(f"{probs[0][0]:.1%} that the image is '{texts[0]}'")

Example output:

95.3% that image 0 is 'a photo of a monkey'

Evaluation

We provide a simple evaluation script in scripts/evaluate.py:

python scripts/evaluate.py \
    --model_name sapienzanlp/siglip-base-patch16-256-ft-concept-pedia \
    --dataset_name sapienzanlp/Concept-10k \
    --split test

Contributing

We welcome contributions! Here's how you can help:

🐛 Report bugs or issues
💡 Suggest new features or improvements
🔬 Share your results using Concept-pedia

Please open an issue on GitHub to discuss your contribution before submitting a pull request.

License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

If you use Concept-pedia or Concept-10k in your work, please cite our paper:

@inproceedings{ghonim-etal-2025-concept,
    title = "Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset",
    author = "Ghonim, Karim  and
      Bejgu, Andrei Stefan  and
      Fern{\'a}ndez-Castro, Alberte  and
      Navigli, Roberto",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1745/",
    pages = "34405--34426",
    ISBN = "979-8-89176-332-6",
    abstract = "Vision-language Models (VLMs), such as CLIP and SigLIP, have become the de facto standard for multimodal tasks, serving as essential building blocks for recent Multimodal Large Language Models, including LLaVA and PaliGemma. However, current evaluations for VLMs remain heavily anchored to ImageNet. In this paper, we question whether ImageNet{'}s coverage is still sufficiently challenging for modern VLMs, and investigate the impact of adding novel and varied concept categories, i.e., semantically grouped fine-grained synsets. To this end, we introduce Concept-pedia, a novel, large-scale, semantically-annotated multimodal resource covering more than 165,000 concepts. Leveraging a language-agnostic, automatic annotation pipeline grounded in Wikipedia, Concept-pedia expands the range of visual concepts, including diverse abstract categories. Building on Concept-pedia, we also present a manually-curated Visual Concept Recognition evaluation benchmark, Concept-10k, that spans thousands of concepts across a wide range of categories. Our experiments show that current models, although excelling on ImageNet, struggle with Concept-10k. Not only do these findings highlight a persistent bias toward ImageNet-centric concepts, but they also underscore the urgent need for more representative benchmarks. By offering a broader and semantically richer testbed, Concept-10k aims to support the development of multimodal systems that better generalize to the complexities of real-world visual concepts."
}

Acknowledgments

This work was conducted by the Sapienza NLP Group and Babelscape.
We gratefully acknowledge the CREATIVE project (CRoss-modal understanding and gEnerATIon of Visual and tExtual content), which is funded by the MUR Progetti di Ricerca di Rilevante Interesse Nazionale programme (PRIN 2020)
We also gratefully acknowledge the support of the PNRR MUR project PE0000013-FAIR.

Contact

For questions or collaborations, please contact: [ghonim@diag.uniroma1.it]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
LICENSE.txt		LICENSE.txt
README.md		README.md
concept-pedia.png		concept-pedia.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concept-pedia:
a Wide-coverage Semantically-annotated
Multimodal Dataset

🔥 News

Table of Contents

Installation

Concept-10k Benchmark

Loading the Dataset

Concept-pedia Models

Model Performance

Zero-shot Visual Concept Recognition

Evaluation

Contributing

License

Citation

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Concept-pedia:a Wide-coverage Semantically-annotatedMultimodal Dataset

🔥 News

Table of Contents

Installation

Concept-10k Benchmark

Loading the Dataset

Concept-pedia Models

Model Performance

Zero-shot Visual Concept Recognition

Evaluation

Contributing

License

Citation

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Concept-pedia:
a Wide-coverage Semantically-annotated
Multimodal Dataset

Packages