Skip to content

google-research-datasets/wit-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Entity Image and Mixed-Modal Image Retrieval Datasets

WIT Retrieval Dataset Image

The Entity Image (EI) dataset comprises a curated collection of canonical images representing entities, sourced from Wikimedia Commons. For each of these entities, we identified and extracted a single, representative canonical image from its corresponding Wikipedia page. This image serves as a visual identifier for the entity.

EI dataset comprises 1.80M entities, each associated with a canonical image.

Mixed-Modal Image Retrieval (MMIR) dataset is constructed by leveraging the comprehensive Wikipedia Image Text (WIT) dataset, which is a carefully curated dataset of 37 million image-text pairings, featuring 11 million distinct images and spanning more than 100 languages.

MMIR dataset encompasses over 9M examples spanning more than 100 languages, partitioned into train, validation, and test splits. To ensure consistency, we maintained the original data splits from the WIT dataset; specifically, each of the MMIR splits was derived from the corresponding WIT split through the entity annotation process.

You can learn more about the datasets from our arXiv paper.

Download

We believe that such a powerful diverse dataset will aid researchers in building better multimodal retrieval models and in identifying better learning and representation techniques leading to improvement of Machine Learning models in real-world tasks over visio-linguistic data.

The datasets are now available for download. Please check the data page.

License

This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Contact

For any questions, please contact [email protected].

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published