Skip to content

Latest commit

 

History

History
204 lines (199 loc) · 124 KB

File metadata and controls

204 lines (199 loc) · 124 KB

Tasks

A list of supported tasks and task groupings can be viewed with lm-eval --tasks list.

For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual README.md files for each subfolder.

Task Family Description Language(s)
eq-bench_es Spanish version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. Hugging Face Spanish Human Translated
eq-bench_ca Catalan version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. Hugging Face Catalan Human Translated
aclue Tasks focusing on ancient Chinese language understanding and cultural aspects. Ancient Chinese
acp_bench Tasks evaluating the reasoning ability about Action, Change, and Planning English
acp_bench_hard Tasks evaluating the reasoning ability about Action, Change, and Planning English
aexams Tasks in Arabic related to various academic exams covering a range of subjects. Arabic
agieval Tasks involving historical data or questions related to history and historical texts. English, Chinese
aime High school math competition questions English
anli Adversarial natural language inference tasks designed to test model robustness. English
arabic_leaderboard_complete A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. Arabic (Some MT)
arabic_leaderboard_light A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. Arabic (Some MT)
arabicmmlu Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. Arabic
ArabCulture Benchmark for evaluating models' commonsense cultural knowledge across different 13 different Arab Countries. Arabic
AraDICE A collection of multiple tasks carefully designed to evaluate dialectal and cultural capabilities in large language models (LLMs). Arabic
arc Tasks involving complex reasoning over a diverse set of questions. English
arithmetic Tasks involving numerical computations and arithmetic reasoning. English
asdiv Tasks involving arithmetic and mathematical reasoning challenges. English
babi Tasks designed as question and answering challenges based on simulated stories. English
babilong Tasks designed to test whether models can find and reason over facts in long contexts. English
bangla_mmlu Benchmark dataset for evaluating language models' performance on Bangla (Bengali) language tasks.Includes diverse NLP tasks to measure model understanding and generation capabilities in Bangla. Bengali/Bangla
basque_bench Collection of tasks in Basque encompassing various evaluation areas. Basque
basqueglue Tasks designed to evaluate language understanding in Basque language. Basque
bbh Tasks focused on deep semantic understanding through hypothesization and reasoning. English, German
bbq A question-answering benchmark designed to measure social biases in language models across various demographic categories and contexts. English
belebele Language understanding tasks in a variety of languages and scripts. Multiple (122 languages)
benchmarks General benchmarking tasks that test a wide range of language understanding capabilities.
bertaqa Local Basque cultural trivia QA tests in English and Basque languages. English, Basque, Basque (MT)
bhs Grammatical knowledge evaluation for low-resource langauges. Basque, Hindi, Swahili
bigbench Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. Multiple
blimp Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. English
blimp_nl A benchmark evaluating language models' grammatical capabilities in Dutch based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences. Dutch
c4 Tasks based on a colossal, cleaned version of Common Crawl's web crawl corpus to assess models' language modeling capabilities. English
cabbq Adaptation of the BBQ benchmark to the Catalan language and stereotypes prevalent in Spain. Catalan
careqa Multiple choice and open-ended medical question answering based on the Spanish Specialised Healthcare Training (MIR) exams. English, Spanish
catalan_bench Collection of tasks in Catalan encompassing various evaluation areas. Catalan
ceval Tasks that evaluate language understanding and reasoning in an educational context. Chinese
cmmlu Multi-subject multiple choice question tasks for comprehensive academic assessment. Chinese
code_x_glue Tasks that involve understanding and generating code across multiple programming languages. Go, Java, JS, PHP, Python, Ruby
cnn_dailymail_abisee Task designed to measure ability to generate multi-sentence abstractive summaries Chinese
commonsense_qa CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. English
copal_id United States Indonesian causal commonsense reasoning dataset that captures local nuances. Indonesian
coqa Conversational question answering tasks to test dialog understanding. English
crows_pairs Tasks designed to test model biases in various sociodemographic groups. English, French
click A benchmark dataset of Cultural and Linguistic Intelligence in Korean (CLIcK), comprising 1,995 QA pairs sourced from official Korean exams and textbooks to test Korean cultural and linguistic knowledge. Korean
csatqa Tasks related to SAT and other standardized testing questions for academic assessment. Korean
darija_bench Traditional NLP tasks (Translation, Summarization, etc..) for Moroccan Darija Moroccan Darija (some MT)
darijahellaswag Moroccan Darija version of HellaSwag. Moroccan Darija (MT)
darijammlu Multiple-choice QA in Moroccan Darija (an Arabic dialect). Moroccan Darija (MT)
discrim_eval Prompts for binary decisions covering 70 scenarios to evaluate demographic bias. English
drop Tasks requiring numerical reasoning, reading comprehension, and question answering. English
egyhellaswag Egyptian Arabic (Masri) version of HellaSwag. Egyptian Arabic (MT)
egymmlu Multiple-choice QA in Egyptian Arabic. Egyptian Arabic (MT)
eq_bench Tasks focused on equality and ethics in question answering and decision-making. English
esbbq Adaptation of the BBQ benchmark to the Spanish language and stereotypes prevalent in Spain. Spanish
eus_exams Tasks based on various professional and academic exams in the Basque language. Basque
eus_proficiency Tasks designed to test proficiency in the Basque language across various topics. Basque
eus_reading Reading comprehension tasks specifically designed for the Basque language. Basque
eus_trivia Trivia atypicnd knowledge testing tasks in the Basque language. Basque
evalita_LLM A native Italian benchmark with diverse tasks formats and multiple prompts. Italian
fda Tasks for extracting key-value pairs from FDA documents to test information extraction. English
fld Tasks involving free-form and directed dialogue understanding. English
french_bench Set of tasks designed to assess language model performance in French. French
galician_bench Collection of tasks in Galician encompassing various evaluation areas. Galician
global_mmlu Collection of culturally sensitive and culturally agnostic MMLU tasks in 15 languages with human translations or post-edits. Multiple (15 languages)
global_piqa Multilingual (non-parallel) commonsense reasoning benchmark covering 116 language varieties with culturally-specific examples from 65 countries Multiple (116 languages) Human authored
glue General Language Understanding Evaluation benchmark to test broad language abilities. English
gpqa Tasks designed for general public question answering and knowledge verification. English
graphwalks A multi-hop reasoning long-context benchmark English
gsm8k A benchmark of grade school math problems aimed at evaluating reasoning capabilities. English
groundcocoa A benchmark evaluating the conditional and compositional reasoning of language models using a grounding task. English
haerae Tasks focused on assessing detailed factual and historical knowledge. Korean
headqa A high-level education-based question answering dataset to test specialized knowledge. Spanish, English
hellaswag Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. English
hendrycks_ethics Tasks designed to evaluate the ethical reasoning capabilities of models. English
hendrycks_math Mathematical problem-solving tasks to test numerical reasoning and problem-solving. English
histoires_morales Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. French (Some MT)
hrm8k A challenging bilingual math reasoning benchmark for Korean and English. Korean (Some MT), English (Some MT)
humaneval Code generation task that measure functional correctness for synthesizing programs from docstrings. Python
humaneval_infilling Code generation task that measure fill-in-the-middle capability for synthesizing programs from docstrings. Python
icelandic_winogrande Manually translated and localized version of the WinoGrande commonsense reasoning benchmark for Icelandic. Icelandic
ifeval Interactive fiction evaluation tasks for narrative understanding and reasoning. English
inverse_scaling Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. English
japanese_leaderboard Japanese language understanding tasks to benchmark model performance on various linguistic aspects. Japanese
jsonschema_bench Evaluate the ability of LLMs to generate JSON objects that conform to a given JSON schema, including API, configuration files, and other structured data formats. JSON
kbl Korean Benchmark for Legal Language Understanding. Korean
kmmlu Knowledge-based multi-subject multiple choice questions for academic evaluation. Korean
kobest A collection of tasks designed to evaluate understanding in Korean language{Fecha: language. Korean
kormedmcqa Medical question answering tasks in Korean to test specialized domain knowledge. Korean
lambada Tasks designed to predict the endings of text passages, testing language prediction skills. English
lambada_cloze Cloze-style LAMBADA dataset. English
lambada_multilingual Multilingual LAMBADA dataset. This is a legacy version of the multilingual dataset, and users should instead use lambada_multilingual_stablelm. German, English, Spanish, French, Italian
lambada_multilingual_stablelm Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on lambada_multilingual. German, English, Spanish, French, Italian, Dutch, Portuguese
leaderboard Task group used by Hugging Face's Open LLM Leaderboard v2. Those tasks are static and will not change through time English
lingoly Challenging logical reasoning benchmark in low-resource languages with controls for memorization English, Multilingual
llama3 Evals reproducing those provided by the LLAMA team in the Hugging Face repo (instruct) English, Multilingual
libra Evaluates long-context understanding in Russian across four complexity levels Russian (MT)
lm_syneval Evaluates the syntactic capabilities of language models. English
logiqa Logical reasoning tasks requiring advanced inference and deduction. English, Chinese
logiqa2 Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. English, Chinese
longbench LongBench evaluates language models' ability to understand lengthy texts across multiple tasks and languages. English, Chinese
longbenchv2 longbench v2, multiple-choice variant. English, Chinese
mastermind Reasoning benchmark based on the board game of Mastermind. English
mathqa Question answering tasks involving mathematical reasoning and problem-solving. English
mbpp A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions. Python
meddialog Medical open-ended QA and Question Entailment stemming from the MedDialog dataset. English
medtext Medical open-ended QA from the MedText Clinical Notes dataset. English
mimic_repsum Medical report summarization from the MIMIC-III dataset. English
mc_taco Question-answer pairs that require temporal commonsense comprehension. English
med_concepts_qa Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept. English
metabench Distilled versions of six popular benchmarks which are highly predictive of overall benchmark performance and of a single general ability latent trait. English
mediqa_qa2019 Open-ended healthcare question answering benchmark from the MEDIQA 2019 challenge. English
medmcqa Medical multiple choice questions assessing detailed medical knowledge. English
medqa Multiple choice question answering based on the United States Medical License Exams.
meqsum Healtcare Question Entailment benchmark from the MeqSum dataset.
mgsm Benchmark of multilingual grade-school math problems. Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu
minerva_math Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. English
mlqa MultiLingual Question Answering benchmark dataset for evaluating cross-lingual question answering performance. English, Arabic, German, Spanish, Hindi, Vietnamese, Simplified Chinese
mmlu Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. English
mmlu_redux Refined Massive Multitask Language Understanding benchmark for broad domain evaluation with improved data quality. English
mmlu_redux Refined Massive Multitask Language Understanding benchmark for broad domain evaluation with improved data quality. Spanish
mmlu_pro A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. English
mmlu-pro-plus A new test set for evaluating shortcut learning and higher-order reasoning of LLMs. English
mmlu_prox A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation. English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian
mmlusr Variation of MMLU designed to be more rigorous. English
model_written_evals Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns.
moral_stories Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. English
mts_dialog Open-ended healthcare QA from the MTS-Dialog dataset. English
multiblimp MultiBLiMP is a (synthetic) multilingual benchmark testing models on linguistic minimal pairs to judge grammatical acceptability Multiple (101 languages) - Synthetic
mutual A retrieval-based dataset for multi-turn dialogue reasoning. English
noreval A human-created Norwegian language understanding and generation benchmark. Norwegian (Bokmål and Nynorsk)
nq_open Open domain question answering tasks based on the Natural Questions dataset. English
okapi/arc_multilingual Tasks that involve reading comprehension and information retrieval challenges. Multiple (31 languages) Machine Translated.
okapi/hellaswag_multilingual Tasks that involve reading comprehension and information retrieval challenges. Multiple (30 languages) Machine Translated.
okapi/mmlu_multilingual Tasks that involve reading comprehension and information retrieval challenges. Multiple (34 languages) Machine Translated.
okapi/truthfulqa_multilingual Tasks that involve reading comprehension and information retrieval challenges. Multiple (31 languages) Machine Translated.
olaph Open-ended medical factuality Question Answering from the OLAPH dataset. English
openbookqa Open-book question answering tasks that require external knowledge and reasoning. English
paloma Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit. English
paws-x Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities. English, French, Spanish, German, Chinese, Japanese, Korean
pile Open source language modelling data set that consists of 22 smaller, high-quality datasets. English
pile_10k The first 10K elements of The Pile, useful for debugging models trained on it. English
piqa Physical Interaction Question Answering tasks to test physical commonsense reasoning. English
polemo2 Sentiment analysis and emotion detection tasks based on Polish language data. Polish
portuguese_bench Collection of tasks in European Portuguese encompassing various evaluation areas. Portuguese
prost Tasks requiring understanding of professional standards and ethics in various domains. English
pubmedqa Question answering tasks based on PubMed research articles for biomedical understanding. English
qa4mre Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. English
qasper Question Answering dataset based on academic papers, testing in-depth scientific knowledge. English
race Reading comprehension assessment tasks based on English exams in China. English
realtoxicityprompts Tasks to evaluate language models for generating text with potential toxicity.
ruler RULER is a benchmark for testing how well language models handle long pieces of text. Requires custom arg (see readme) English
sciq Science Question Answering tasks to assess understanding of scientific concepts. English
score Systematic consistency and robustness evaluation for LLMs on 3 datasets(MMLU-Pro, Agi Eval and MATH) English
scrolls Tasks that involve long-form reading comprehension across various domains. English
simple_cooccurrence_bias A metric that evaluates language models for biases based on stereotypical word associations and co-occurrences in text. English
siqa Social Interaction Question Answering to evaluate common sense and social reasoning. English
spanish_bench Collection of tasks in Spanish encompassing various evaluation areas. Spanish
squad_completion A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. English
squadv2 Stanford Question Answering Dataset version 2, a reading comprehension benchmark. English
storycloze Tasks to predict story endings, focusing on narrative logic and coherence. English
super_glue A suite of challenging tasks designed to test a range of language understanding skills. English
swag Situations With Adversarial Generations, predicting the next event in videos. English
swde Information extraction tasks from semi-structured web pages. English
tinyBenchmarks Evaluation of large language models with fewer examples using tiny versions of popular benchmarks. English
tmmluplus An extended set of tasks under the TMMLU framework for broader academic assessments. Traditional Chinese
toxigen Tasks designed to evaluate language models on their propensity to generate toxic content. English
translation Tasks focused on evaluating the language translation capabilities of models. Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese
triviaqa A large-scale dataset for trivia question answering to test general knowledge. English
truthfulqa A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. English
truthfulqa-multi Is a multilingual version of TruthfulQA, a QA task aimed at evaluating the truthfulness and factual accuracy of model responses. English, Spanish, Catalan, Basque, Galician
turkishmmlu A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. Turkish
turblimp_core A benchmark evaluating language models' grammatical capabilities in Turkish based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences. Turkish
unitxt A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. English
unscramble Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. English
webqs Web-based question answering tasks designed to evaluate internet search and retrieval. English
wikitext Tasks based on text from Wikipedia articles to assess language modeling and generation. English
winogender A diagnostic dataset that tests for gender bias in coreference resolution by measuring how models associate pronouns with different occupations. English
winogrande A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge. English
wmdp A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions. English
wmt2016 Tasks from the WMT 2016 shared task, focusing on translation between multiple languages. English, Czech, German, Finnish, Russian, Romanian, Turkish
wsc273 The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. English
xcopa Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages. Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese
xnli Cross-Lingual Natural Language Inference to test understanding across different languages. Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese
xnli_eu Cross-lingual Natural Language Inference tasks in Basque. Basque
xquad Cross-lingual Question Answering Dataset in multiple languages. Arabic, German, Greek, English, Spanish, Hindi, Romanian, Russian, Thai, Turkish, Vietnamese, Chinese
xstorycloze Cross-lingual narrative understanding tasks to predict story endings in multiple languages. Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese
xwinograd Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. English, French, Japanese, Portuguese, Russian, Chinese
zhoblimp A benchmark evaluating language models' grammatical capabilities in Chinese based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences. Chinese

Multimodal Tasks

Task Family Description Modality
chartqa A benchmark for question answering about charts that requires both visual and logical reasoning. Image, Text
mmmu Evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge. Image, Text