Tasks

A list of supported tasks and task groupings can be viewed with lm-eval --tasks list.

For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual README.md files for each subfolder.

Task Family	Description	Language(s)
eq-bench_es	Spanish version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. Hugging Face	Spanish Human Translated
eq-bench_ca	Catalan version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. Hugging Face	Catalan Human Translated
aclue	Tasks focusing on ancient Chinese language understanding and cultural aspects.	Ancient Chinese
acp_bench	Tasks evaluating the reasoning ability about Action, Change, and Planning	English
acp_bench_hard	Tasks evaluating the reasoning ability about Action, Change, and Planning	English
aexams	Tasks in Arabic related to various academic exams covering a range of subjects.	Arabic
agieval	Tasks involving historical data or questions related to history and historical texts.	English, Chinese
aime	High school math competition questions	English
anli	Adversarial natural language inference tasks designed to test model robustness.	English
arabic_leaderboard_complete	A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated.	Arabic (Some MT)
arabic_leaderboard_light	A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated.	Arabic (Some MT)
arabicmmlu	Localized Arabic version of MMLU with multiple-choice questions from 40 subjects.	Arabic
ArabCulture	Benchmark for evaluating models' commonsense cultural knowledge across different 13 different Arab Countries.	Arabic
AraDICE	A collection of multiple tasks carefully designed to evaluate dialectal and cultural capabilities in large language models (LLMs).	Arabic
arc	Tasks involving complex reasoning over a diverse set of questions.	English
arithmetic	Tasks involving numerical computations and arithmetic reasoning.	English
asdiv	Tasks involving arithmetic and mathematical reasoning challenges.	English
babi	Tasks designed as question and answering challenges based on simulated stories.	English
babilong	Tasks designed to test whether models can find and reason over facts in long contexts.	English
bangla_mmlu	Benchmark dataset for evaluating language models' performance on Bangla (Bengali) language tasks.Includes diverse NLP tasks to measure model understanding and generation capabilities in Bangla.	Bengali/Bangla
basque_bench	Collection of tasks in Basque encompassing various evaluation areas.	Basque
basqueglue	Tasks designed to evaluate language understanding in Basque language.	Basque
bbh	Tasks focused on deep semantic understanding through hypothesization and reasoning.	English, German
bbq	A question-answering benchmark designed to measure social biases in language models across various demographic categories and contexts.	English
belebele	Language understanding tasks in a variety of languages and scripts.	Multiple (122 languages)
benchmarks	General benchmarking tasks that test a wide range of language understanding capabilities.
bertaqa	Local Basque cultural trivia QA tests in English and Basque languages.	English, Basque, Basque (MT)
bhs	Grammatical knowledge evaluation for low-resource langauges.	Basque, Hindi, Swahili
bigbench	Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models.	Multiple
blimp	Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities.	English
blimp_nl	A benchmark evaluating language models' grammatical capabilities in Dutch based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences.	Dutch
c4	Tasks based on a colossal, cleaned version of Common Crawl's web crawl corpus to assess models' language modeling capabilities.	English
cabbq	Adaptation of the BBQ benchmark to the Catalan language and stereotypes prevalent in Spain.	Catalan
careqa	Multiple choice and open-ended medical question answering based on the Spanish Specialised Healthcare Training (MIR) exams.	English, Spanish
catalan_bench	Collection of tasks in Catalan encompassing various evaluation areas.	Catalan
ceval	Tasks that evaluate language understanding and reasoning in an educational context.	Chinese
cmmlu	Multi-subject multiple choice question tasks for comprehensive academic assessment.	Chinese
code_x_glue	Tasks that involve understanding and generating code across multiple programming languages.	Go, Java, JS, PHP, Python, Ruby
cnn_dailymail_abisee	Task designed to measure ability to generate multi-sentence abstractive summaries	Chinese
commonsense_qa	CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge.	English
copal_id United States	Indonesian causal commonsense reasoning dataset that captures local nuances.	Indonesian
coqa	Conversational question answering tasks to test dialog understanding.	English
crows_pairs	Tasks designed to test model biases in various sociodemographic groups.	English, French
click	A benchmark dataset of Cultural and Linguistic Intelligence in Korean (CLIcK), comprising 1,995 QA pairs sourced from official Korean exams and textbooks to test Korean cultural and linguistic knowledge.	Korean
csatqa	Tasks related to SAT and other standardized testing questions for academic assessment.	Korean
darija_bench	Traditional NLP tasks (Translation, Summarization, etc..) for Moroccan Darija	Moroccan Darija (some MT)
darijahellaswag	Moroccan Darija version of HellaSwag.	Moroccan Darija (MT)
darijammlu	Multiple-choice QA in Moroccan Darija (an Arabic dialect).	Moroccan Darija (MT)
discrim_eval	Prompts for binary decisions covering 70 scenarios to evaluate demographic bias.	English
drop	Tasks requiring numerical reasoning, reading comprehension, and question answering.	English
egyhellaswag	Egyptian Arabic (Masri) version of HellaSwag.	Egyptian Arabic (MT)
egymmlu	Multiple-choice QA in Egyptian Arabic.	Egyptian Arabic (MT)
eq_bench	Tasks focused on equality and ethics in question answering and decision-making.	English
esbbq	Adaptation of the BBQ benchmark to the Spanish language and stereotypes prevalent in Spain.	Spanish
eus_exams	Tasks based on various professional and academic exams in the Basque language.	Basque
eus_proficiency	Tasks designed to test proficiency in the Basque language across various topics.	Basque
eus_reading	Reading comprehension tasks specifically designed for the Basque language.	Basque
eus_trivia	Trivia atypicnd knowledge testing tasks in the Basque language.	Basque
evalita_LLM	A native Italian benchmark with diverse tasks formats and multiple prompts.	Italian
fda	Tasks for extracting key-value pairs from FDA documents to test information extraction.	English
fld	Tasks involving free-form and directed dialogue understanding.	English
french_bench	Set of tasks designed to assess language model performance in French.	French
galician_bench	Collection of tasks in Galician encompassing various evaluation areas.	Galician
global_mmlu	Collection of culturally sensitive and culturally agnostic MMLU tasks in 15 languages with human translations or post-edits.	Multiple (15 languages)
global_piqa	Multilingual (non-parallel) commonsense reasoning benchmark covering 116 language varieties with culturally-specific examples from 65 countries	Multiple (116 languages) Human authored
glue	General Language Understanding Evaluation benchmark to test broad language abilities.	English
gpqa	Tasks designed for general public question answering and knowledge verification.	English
graphwalks	A multi-hop reasoning long-context benchmark	English
gsm8k	A benchmark of grade school math problems aimed at evaluating reasoning capabilities.	English
groundcocoa	A benchmark evaluating the conditional and compositional reasoning of language models using a grounding task.	English
haerae	Tasks focused on assessing detailed factual and historical knowledge.	Korean
headqa	A high-level education-based question answering dataset to test specialized knowledge.	Spanish, English
hellaswag	Tasks to predict the ending of stories or scenarios, testing comprehension and creativity.	English
hendrycks_ethics	Tasks designed to evaluate the ethical reasoning capabilities of models.	English
hendrycks_math	Mathematical problem-solving tasks to test numerical reasoning and problem-solving.	English
histoires_morales	Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks.	French (Some MT)
hrm8k	A challenging bilingual math reasoning benchmark for Korean and English.	Korean (Some MT), English (Some MT)
humaneval	Code generation task that measure functional correctness for synthesizing programs from docstrings.	Python
humaneval_infilling	Code generation task that measure fill-in-the-middle capability for synthesizing programs from docstrings.	Python
icelandic_winogrande	Manually translated and localized version of the WinoGrande commonsense reasoning benchmark for Icelandic.	Icelandic
ifeval	Interactive fiction evaluation tasks for narrative understanding and reasoning.	English
inverse_scaling	Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse.	English
japanese_leaderboard	Japanese language understanding tasks to benchmark model performance on various linguistic aspects.	Japanese
jsonschema_bench	Evaluate the ability of LLMs to generate JSON objects that conform to a given JSON schema, including API, configuration files, and other structured data formats.	JSON
kbl	Korean Benchmark for Legal Language Understanding.	Korean
kmmlu	Knowledge-based multi-subject multiple choice questions for academic evaluation.	Korean
kobest	A collection of tasks designed to evaluate understanding in Korean language{Fecha: language.	Korean
kormedmcqa	Medical question answering tasks in Korean to test specialized domain knowledge.	Korean
lambada	Tasks designed to predict the endings of text passages, testing language prediction skills.	English
lambada_cloze	Cloze-style LAMBADA dataset.	English
lambada_multilingual	Multilingual LAMBADA dataset. This is a legacy version of the multilingual dataset, and users should instead use `lambada_multilingual_stablelm`.	German, English, Spanish, French, Italian
lambada_multilingual_stablelm	Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on `lambada_multilingual`.	German, English, Spanish, French, Italian, Dutch, Portuguese
leaderboard	Task group used by Hugging Face's Open LLM Leaderboard v2. Those tasks are static and will not change through time	English
lingoly	Challenging logical reasoning benchmark in low-resource languages with controls for memorization	English, Multilingual
llama3	Evals reproducing those provided by the LLAMA team in the Hugging Face repo (instruct)	English, Multilingual
libra	Evaluates long-context understanding in Russian across four complexity levels	Russian (MT)
lm_syneval	Evaluates the syntactic capabilities of language models.	English
logiqa	Logical reasoning tasks requiring advanced inference and deduction.	English, Chinese
logiqa2	Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination.	English, Chinese
longbench	LongBench evaluates language models' ability to understand lengthy texts across multiple tasks and languages.	English, Chinese
longbenchv2	longbench v2, multiple-choice variant.	English, Chinese
mastermind	Reasoning benchmark based on the board game of Mastermind.	English
mathqa	Question answering tasks involving mathematical reasoning and problem-solving.	English
mbpp	A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions.	Python
meddialog	Medical open-ended QA and Question Entailment stemming from the MedDialog dataset.	English
medtext	Medical open-ended QA from the MedText Clinical Notes dataset.	English
mimic_repsum	Medical report summarization from the MIMIC-III dataset.	English
mc_taco	Question-answer pairs that require temporal commonsense comprehension.	English
med_concepts_qa	Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept.	English
metabench	Distilled versions of six popular benchmarks which are highly predictive of overall benchmark performance and of a single general ability latent trait.	English
mediqa_qa2019	Open-ended healthcare question answering benchmark from the MEDIQA 2019 challenge.	English
medmcqa	Medical multiple choice questions assessing detailed medical knowledge.	English
medqa	Multiple choice question answering based on the United States Medical License Exams.
meqsum	Healtcare Question Entailment benchmark from the MeqSum dataset.
mgsm	Benchmark of multilingual grade-school math problems.	Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu
minerva_math	Mathematics-focused tasks requiring numerical reasoning and problem-solving skills.	English
mlqa	MultiLingual Question Answering benchmark dataset for evaluating cross-lingual question answering performance.	English, Arabic, German, Spanish, Hindi, Vietnamese, Simplified Chinese
mmlu	Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.	English
mmlu_redux	Refined Massive Multitask Language Understanding benchmark for broad domain evaluation with improved data quality.	English
mmlu_redux	Refined Massive Multitask Language Understanding benchmark for broad domain evaluation with improved data quality.	Spanish
mmlu_pro	A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.	English
mmlu-pro-plus	A new test set for evaluating shortcut learning and higher-order reasoning of LLMs.	English
mmlu_prox	A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation.	English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian
mmlusr	Variation of MMLU designed to be more rigorous.	English
model_written_evals	Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns.
moral_stories	Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks.	English
mts_dialog	Open-ended healthcare QA from the MTS-Dialog dataset.	English
multiblimp	MultiBLiMP is a (synthetic) multilingual benchmark testing models on linguistic minimal pairs to judge grammatical acceptability	Multiple (101 languages) - Synthetic
mutual	A retrieval-based dataset for multi-turn dialogue reasoning.	English
noreval	A human-created Norwegian language understanding and generation benchmark.	Norwegian (Bokmål and Nynorsk)
nq_open	Open domain question answering tasks based on the Natural Questions dataset.	English
okapi/arc_multilingual	Tasks that involve reading comprehension and information retrieval challenges.	Multiple (31 languages) Machine Translated.
okapi/hellaswag_multilingual	Tasks that involve reading comprehension and information retrieval challenges.	Multiple (30 languages) Machine Translated.
okapi/mmlu_multilingual	Tasks that involve reading comprehension and information retrieval challenges.	Multiple (34 languages) Machine Translated.
okapi/truthfulqa_multilingual	Tasks that involve reading comprehension and information retrieval challenges.	Multiple (31 languages) Machine Translated.
olaph	Open-ended medical factuality Question Answering from the OLAPH dataset.	English
openbookqa	Open-book question answering tasks that require external knowledge and reasoning.	English
paloma	Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit.	English
paws-x	Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities.	English, French, Spanish, German, Chinese, Japanese, Korean
pile	Open source language modelling data set that consists of 22 smaller, high-quality datasets.	English
pile_10k	The first 10K elements of The Pile, useful for debugging models trained on it.	English
piqa	Physical Interaction Question Answering tasks to test physical commonsense reasoning.	English
polemo2	Sentiment analysis and emotion detection tasks based on Polish language data.	Polish
portuguese_bench	Collection of tasks in European Portuguese encompassing various evaluation areas.	Portuguese
prost	Tasks requiring understanding of professional standards and ethics in various domains.	English
pubmedqa	Question answering tasks based on PubMed research articles for biomedical understanding.	English
qa4mre	Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning.	English
qasper	Question Answering dataset based on academic papers, testing in-depth scientific knowledge.	English
race	Reading comprehension assessment tasks based on English exams in China.	English
realtoxicityprompts	Tasks to evaluate language models for generating text with potential toxicity.
ruler	RULER is a benchmark for testing how well language models handle long pieces of text. Requires custom arg (see readme)	English
sciq	Science Question Answering tasks to assess understanding of scientific concepts.	English
score	Systematic consistency and robustness evaluation for LLMs on 3 datasets(MMLU-Pro, Agi Eval and MATH)	English
scrolls	Tasks that involve long-form reading comprehension across various domains.	English
simple_cooccurrence_bias	A metric that evaluates language models for biases based on stereotypical word associations and co-occurrences in text.	English
siqa	Social Interaction Question Answering to evaluate common sense and social reasoning.	English
spanish_bench	Collection of tasks in Spanish encompassing various evaluation areas.	Spanish
squad_completion	A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs.	English
squadv2	Stanford Question Answering Dataset version 2, a reading comprehension benchmark.	English
storycloze	Tasks to predict story endings, focusing on narrative logic and coherence.	English
super_glue	A suite of challenging tasks designed to test a range of language understanding skills.	English
swag	Situations With Adversarial Generations, predicting the next event in videos.	English
swde	Information extraction tasks from semi-structured web pages.	English
tinyBenchmarks	Evaluation of large language models with fewer examples using tiny versions of popular benchmarks.	English
tmmluplus	An extended set of tasks under the TMMLU framework for broader academic assessments.	Traditional Chinese
toxigen	Tasks designed to evaluate language models on their propensity to generate toxic content.	English
translation	Tasks focused on evaluating the language translation capabilities of models.	Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese
triviaqa	A large-scale dataset for trivia question answering to test general knowledge.	English
truthfulqa	A QA task aimed at evaluating the truthfulness and factual accuracy of model responses.	English
truthfulqa-multi	Is a multilingual version of TruthfulQA, a QA task aimed at evaluating the truthfulness and factual accuracy of model responses.	English, Spanish, Catalan, Basque, Galician
turkishmmlu	A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams.	Turkish
turblimp_core	A benchmark evaluating language models' grammatical capabilities in Turkish based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences.	Turkish
unitxt	A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI.	English
unscramble	Tasks involving the rearrangement of scrambled sentences to test syntactic understanding.	English
webqs	Web-based question answering tasks designed to evaluate internet search and retrieval.	English
wikitext	Tasks based on text from Wikipedia articles to assess language modeling and generation.	English
winogender	A diagnostic dataset that tests for gender bias in coreference resolution by measuring how models associate pronouns with different occupations.	English
winogrande	A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge.	English
wmdp	A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions.	English
wmt2016	Tasks from the WMT 2016 shared task, focusing on translation between multiple languages.	English, Czech, German, Finnish, Russian, Romanian, Turkish
wsc273	The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution.	English
xcopa	Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages.	Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese
xnli	Cross-Lingual Natural Language Inference to test understanding across different languages.	Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese
xnli_eu	Cross-lingual Natural Language Inference tasks in Basque.	Basque
xquad	Cross-lingual Question Answering Dataset in multiple languages.	Arabic, German, Greek, English, Spanish, Hindi, Romanian, Russian, Thai, Turkish, Vietnamese, Chinese
xstorycloze	Cross-lingual narrative understanding tasks to predict story endings in multiple languages.	Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese
xwinograd	Cross-lingual Winograd schema tasks for coreference resolution in multiple languages.	English, French, Japanese, Portuguese, Russian, Chinese
zhoblimp	A benchmark evaluating language models' grammatical capabilities in Chinese based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences.	Chinese

Multimodal Tasks

Task Family	Description	Modality
chartqa	A benchmark for question answering about charts that requires both visual and logical reasoning.	Image, Text
mmmu	Evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge.	Image, Text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks

Multimodal Tasks

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Tasks

Multimodal Tasks