Skip to content
Change the repository type filter

All

    Repositories list

    • An annotated dataset of 9,003 adversarial queries in seven Sub-Saharan African languages.
      Jupyter Notebook
      3300Updated Jan 27, 2026Jan 27, 2026
    • SAFARI

      Public
      The dataset consists of stereotypes collected in 4 Sub Saharan African countries for the purpose of AI model evaluations.
      0000Updated Jan 24, 2026Jan 24, 2026
    • The dataset consists of globally situated cultural artifacts, covering 29 countries and many key aspects of culture.
      0000Updated Jan 22, 2026Jan 22, 2026
    • artydiqa

      Public
      ArTyDi-QA is a dataset for Question Answering (QA) and Question Generation (QG) in Modern Standard Arabic (MSA), adapted from TyDiQA. It features extractive QA …
      0000Updated Dec 18, 2025Dec 18, 2025
    • This dataset provides a glossary of AI terms in Swahili, Zulu, Xhosa, Afrikaans, English (as the common core), and other languages widely spoken in Africa. It's…
      HTML
      Creative Commons Attribution Share Alike 4.0 International
      1310Updated Dec 17, 2025Dec 17, 2025
    • MGSM-Rev2

      Public
      To improve the MGSM benchmark, we corrected two erroneous English questions and rephrased others to remove ambiguity. We then used Gemini to retranslate all que…
      0000Updated Nov 10, 2025Nov 10, 2025
    • Other
      0510Updated Oct 13, 2025Oct 13, 2025
    • The dataset consists of AI generated stories and accompanied human ratings on their cultural fluency and relevance.
      Apache License 2.0
      0200Updated Aug 6, 2025Aug 6, 2025
    • Jupyter Notebook
      Apache License 2.0
      0000Updated Jul 30, 2025Jul 30, 2025
    • Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
      Other
      1841760Updated Jul 14, 2025Jul 14, 2025
    • Python
      Apache License 2.0
      14853Updated Jun 27, 2025Jun 27, 2025
    • Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
      MIT License
      12010Updated Jun 16, 2025Jun 16, 2025
    • CONFLICTS is a QA dataset annotated with knowledge conflict types. Each instance comprises a query, a set of retrieved relevant passages, a corresponding confli…
      Apache License 2.0
      11210Updated Jun 11, 2025Jun 11, 2025
    • egotempo

      Public
      Jupyter Notebook
      Creative Commons Attribution 4.0 International
      02630Updated Apr 26, 2025Apr 26, 2025
    • Images gathered from the Internet in 2023 and some metadata
      HTML
      Other
      1300Updated Mar 19, 2025Mar 19, 2025
    • screen_qa

      Public
      ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs c…
      Python
      Creative Commons Attribution 4.0 International
      913940Updated Feb 7, 2025Feb 7, 2025
    • This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and va…
      Creative Commons Attribution 4.0 International
      42500Updated Feb 3, 2025Feb 3, 2025
    • cube

      Public
      CUBE is a benchmark to evaluate the Cultural Competence of T2I models
      Creative Commons Attribution 4.0 International
      1830Updated Jan 20, 2025Jan 20, 2025
    • Jupyter Notebook
      Apache License 2.0
      176630Updated Jan 17, 2025Jan 17, 2025
    • hiertext

      Public
      The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotati…
      Jupyter Notebook
      Creative Commons Attribution Share Alike 4.0 International
      2930601Updated Dec 2, 2024Dec 2, 2024
    • scin

      Public
      The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-report…
      Jupyter Notebook
      Other
      1915520Updated Nov 23, 2024Nov 23, 2024
    • MISeD

      Public
      MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transc…
      31400Updated Nov 20, 2024Nov 20, 2024
    • uicrit

      Public archive
      UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for…
      02610Updated Nov 19, 2024Nov 19, 2024
    • WordGraph

      Public
      The WordGraph dataset contains multilingual lexicon entries linked to wikipedia entities, focusing on human-denoting names and demonym adjectives. Each lexicon …
      Creative Commons Zero v1.0 Universal
      1100Updated Nov 7, 2024Nov 7, 2024
    • Dataset of conversations, generated by prompting Gemini Ultra. These are conversations between a teacher and a student, where the teacher is prompted with speci…
      63310Updated Oct 29, 2024Oct 29, 2024
    • GeniL

      Public
      GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, …
      Creative Commons Attribution 4.0 International
      0300Updated Oct 18, 2024Oct 18, 2024
    • tap-typing-with-touch-sensing-images

      Public archive
      The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive…
      Creative Commons Attribution 4.0 International
      1200Updated Oct 15, 2024Oct 15, 2024
    • mittens

      Public
      Datasets for measuring misgendering in translation
      Other
      0500Updated Oct 4, 2024Oct 4, 2024
    • wit

      Public archive
      WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ language…
      Other
      461.1k10Updated Sep 27, 2024Sep 27, 2024
    • C4_200M-synthetic-dataset-for-grammatical-error-correction

      Public
      This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged cor…
      Python
      Creative Commons Attribution 4.0 International
      2316200Updated Sep 24, 2024Sep 24, 2024