Google Research Datasets

All

174 repositories

Amplify_SSA
Public
An annotated dataset of 9,003 adversarial queries in seven Sub-Saharan African languages.
Jupyter Notebook
•3•3•0•0•Updated Jan 27, 2026Jan 27, 2026
SAFARI
Public
The dataset consists of stereotypes collected in 4 Sub Saharan African countries for the purpose of AI model evaluations.
0•0•0•0•Updated Jan 24, 2026Jan 24, 2026
SCALE-Cultural-Data
Public
The dataset consists of globally situated cultural artifacts, covering 29 countries and many key aspects of culture.
0•0•0•0•Updated Jan 22, 2026Jan 22, 2026
artydiqa
Public
ArTyDi-QA is a dataset for Question Answering (QA) and Question Generation (QG) in Modern Standard Arabic (MSA), adapted from TyDiQA. It features extractive QA …
0•0•0•0•Updated Dec 18, 2025Dec 18, 2025
ssa-ai-terminologies
Public
This dataset provides a glossary of AI terms in Swahili, Zulu, Xhosa, Afrikaans, English (as the common core), and other languages widely spoken in Africa. It's…
HTML
•
Creative Commons Attribution Share Alike 4.0 International
•1•3•2•0•Updated Dec 17, 2025Dec 17, 2025
MGSM-Rev2
Public
To improve the MGSM benchmark, we corrected two erroneous English questions and rephrased others to remove ambiguity. We then used Gemini to retranslate all que…
0•0•0•0•Updated Nov 10, 2025Nov 10, 2025
wit-retrieval
Public
Other
•0•5•1•0•Updated Oct 13, 2025Oct 13, 2025
cultural_familiarity_annotations
Public
The dataset consists of AI generated stories and accompanied human ratings on their cultural fluency and relevance.
Apache License 2.0
•0•2•0•0•Updated Aug 6, 2025Aug 6, 2025
tydiqa-wana
Public
Jupyter Notebook
•
Apache License 2.0
•0•0•0•0•Updated Jul 30, 2025Jul 30, 2025
conceptual-12m
Public
Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
vision-and-language pre-training multimodal-dataset
vision-and-language pre-training multimodal-dataset
Other
•18•418•6•0•Updated Jul 14, 2025Jul 14, 2025
sanpo_dataset
Public
Python
•
Apache License 2.0
•1•48•5•3•Updated Jun 27, 2025Jun 27, 2025
common-crawl-domain-names
Public
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
MIT License
•1•20•1•0•Updated Jun 16, 2025Jun 16, 2025
rag_conflicts
Public
CONFLICTS is a QA dataset annotated with knowledge conflict types. Each instance comprises a query, a set of retrieved relevant passages, a corresponding confli…
Apache License 2.0
•1•12•1•0•Updated Jun 11, 2025Jun 11, 2025
egotempo
Public
Jupyter Notebook
•
Creative Commons Attribution 4.0 International
•0•26•3•0•Updated Apr 26, 2025Apr 26, 2025
web-images
Public
Images gathered from the Internet in 2023 and some metadata
HTML
•
Other
•1•3•0•0•Updated Mar 19, 2025Mar 19, 2025
screen_qa
Public
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs c…
Python
•
Creative Commons Attribution 4.0 International
•9•139•4•0•Updated Feb 7, 2025Feb 7, 2025
adversarial-nibbler
Public
This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and va…
Creative Commons Attribution 4.0 International
•4•25•0•0•Updated Feb 3, 2025Feb 3, 2025
cube
Public
CUBE is a benchmark to evaluate the Cultural Competence of T2I models
Creative Commons Attribution 4.0 International
•1•8•3•0•Updated Jan 20, 2025Jan 20, 2025
global_streamflow_model_paper
Public
Jupyter Notebook
•
Apache License 2.0
•17•66•4•0•Updated Jan 17, 2025Jan 17, 2025
hiertext
Public
The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotati…
Jupyter Notebook
•
Creative Commons Attribution Share Alike 4.0 International
•29•306•0•1•Updated Dec 2, 2024Dec 2, 2024
scin
Public
The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-report…
Jupyter Notebook
•
Other
•19•155•2•0•Updated Nov 23, 2024Nov 23, 2024
MISeD
Public
MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transc…
3•14•0•0•Updated Nov 20, 2024Nov 20, 2024
uicrit
Public archive
UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for…
0•26•1•0•Updated Nov 19, 2024Nov 19, 2024
WordGraph
Public
The WordGraph dataset contains multilingual lexicon entries linked to wikipedia entities, focusing on human-denoting names and demonym adjectives. Each lexicon …
Creative Commons Zero v1.0 Universal
•1•1•0•0•Updated Nov 7, 2024Nov 7, 2024
Education-Dialogue-Dataset
Public archive
Dataset of conversations, generated by prompting Gemini Ultra. These are conversations between a teacher and a student, where the teacher is prompted with speci…
6•33•1•0•Updated Oct 29, 2024Oct 29, 2024
GeniL
Public
GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, …
Creative Commons Attribution 4.0 International
•0•3•0•0•Updated Oct 18, 2024Oct 18, 2024
tap-typing-with-touch-sensing-images
Public archive
The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive…
Creative Commons Attribution 4.0 International
•1•2•0•0•Updated Oct 15, 2024Oct 15, 2024
mittens
Public
Datasets for measuring misgendering in translation
Other
•0•5•0•0•Updated Oct 4, 2024Oct 4, 2024
wit
Public archive
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ language…
multilingual nlp machine-learning
multilingual nlp machine-learning wikipedia multimodal cc-by-sa-3
Other
•46•1.1k•1•0•Updated Sep 27, 2024Sep 27, 2024
C4_200M-synthetic-dataset-for-grammatical-error-correction
Public
This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged cor…
Python
•
Creative Commons Attribution 4.0 International
•23•162•0•0•Updated Sep 24, 2024Sep 24, 2024