You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lm_eval/tasks/README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,6 +12,7 @@ provided to the individual README.md files for each subfolder.
12
12
|[acp_bench_hard](acpbench/README.md)| Tasks evaluating the reasoning ability about Action, Change, and Planning | English |
13
13
|[aexams](aexams/README.md)| Tasks in Arabic related to various academic exams covering a range of subjects. | Arabic |
14
14
|[agieval](agieval/README.md)| Tasks involving historical data or questions related to history and historical texts. | English, Chinese |
15
+
|[aime](aime/README.md)| High school math competition questions | English |
15
16
|[anli](anli/README.md)| Adversarial natural language inference tasks designed to test model robustness. | English |
16
17
|[arabic_leaderboard_complete](arabic_leaderboard_complete/README.md)| A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) |
17
18
|[arabic_leaderboard_light](arabic_leaderboard_light/README.md)| A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) |
@@ -30,7 +31,7 @@ provided to the individual README.md files for each subfolder.
30
31
|[belebele](belebele/README.md)| Language understanding tasks in a variety of languages and scripts. | Multiple (122 languages) |
31
32
| benchmarks | General benchmarking tasks that test a wide range of language understanding capabilities. ||
32
33
|[bertaqa](bertaqa/README.md)| Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT) |
|[bigbench](bigbench/README.md)| Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple |
35
36
|[blimp](blimp/README.md)| Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English |
36
37
|[blimp_nl](blimp_nl/README.md)| A benchmark evaluating language models' grammatical capabilities in Dutch based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences. | Dutch |
@@ -78,7 +79,7 @@ provided to the individual README.md files for each subfolder.
78
79
|[histoires_morales](histoires_morales/README.md)| A dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | French (Some MT) |
79
80
|[hrm8k](hrm8k/README.md)| A challenging bilingual math reasoning benchmark for Korean and English. | Korean (Some MT), English (Some MT) |
80
81
|[humaneval](humaneval/README.md)| Code generation task that measure functional correctness for synthesizing programs from docstrings. | Python |
81
-
|[icelandic_winogrande](icelandic_winogrande/README.md)| Manually translated and localized version of the [WinoGrande](winogrande/README.md) commonsense reasoning benchmark for Icelandic. | Icelandic|
82
+
|[icelandic_winogrande](icelandic_winogrande/README.md)| Manually translated and localized version of the [WinoGrande](winogrande/README.md) commonsense reasoning benchmark for Icelandic. | Icelandic |
82
83
|[ifeval](ifeval/README.md)| Interactive fiction evaluation tasks for narrative understanding and reasoning. | English |
83
84
|[inverse_scaling](inverse_scaling/README.md)| Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English |
84
85
|[japanese_leaderboard](japanese_leaderboard/README.md)| Japanese language understanding tasks to benchmark model performance on various linguistic aspects. | Japanese |
0 commit comments