Skip to content

Commit fec9dde

Browse files
luiscosioCT-6282
andauthored
feat: Add mmlu-redux and it's spanish transaltion as generative task definitions (#2705)
* Added benchmark * Added more testing * Added task definition for mmlu_redux and mmlu_redux_spanish * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs * Add remaining MMLU Redux YAMLs and updated tasks README * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs * Add MMLU Redux changes from pr-2705 * Resolve pre-commit hook and pytest overlapping group issues by adding mmlu_redux_spanish task entries and unique subgroup names * Enhance retry logic to prevent 429 error when using Hugging Face API for tests, apply pre-commit fixes * Revert python test changes and comments one task group to avoid Hugging Face rate limit and task failure --------- Co-authored-by: CT-6282 <[email protected]>
1 parent 368275f commit fec9dde

File tree

122 files changed

+1124
-6
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

122 files changed

+1124
-6
lines changed

lm_eval/tasks/README.md

Lines changed: 8 additions & 6 deletions
Large diffs are not rendered by default.
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Task-name
2+
3+
### Paper
4+
5+
Title: `Are We Donewith MMLU?`
6+
7+
Abstract: `https://arxiv.org/pdf/2406.04127`
8+
9+
`The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more, in Spanish`
10+
11+
Homepage: `https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0`
12+
13+
### Citation
14+
15+
```
16+
BibTeX
17+
@misc{edinburgh2024mmlu,
18+
title={Are We Done with MMLU?},
19+
author={Aryo Pradipta Gema and Joshua Ong Jun Leang and Giwon Hong and Alessio Devoto and
20+
Alberto Carlo Maria Mancino and Rohit Saxena and Xuanli He and Yu Zhao and Xiaotang Du and
21+
MohammadRezaGhasemi Madani and Claire Barale and Robert McHardy and Joshua Harris and
22+
Jean Kaddour and Emile van Krieken and Pasquale Minervini},
23+
year={2025},
24+
eprint={2406.04127},
25+
archivePrefix={arXiv},
26+
primaryClass={cs.CL}
27+
}
28+
```
29+
30+
### Groups, Tags, and Tasks
31+
32+
#### Groups
33+
34+
- `stem`
35+
- `other`
36+
- `social sciences`
37+
- `humanities`
38+
39+
#### Tasks
40+
41+
- `mmlu_stem_generative_spanish`
42+
- `mmlu_other_generative_spanish`
43+
- `mmlu_social_sciences_generative_spanish`
44+
- `mmlu_humanities_generative_spanish`
45+
46+
### Checklist
47+
48+
For adding novel benchmarks/datasets to the library:
49+
50+
- [x] Is the task an existing benchmark in the literature?
51+
- [x] Have you referenced the original paper that introduced the task?
52+
- [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
53+
54+
If other tasks on this dataset are already supported:
55+
56+
- [ ] Is the "Main" variant of this task clearly denoted?
57+
- [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
58+
- [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
59+
60+
ver 1: PR #2705
61+
First implementation
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
dataset_path: "amias-mx/mmlu-redux-2.0-spanish"
2+
test_split: test
3+
dataset_kwargs:
4+
trust_remote_code: true
5+
output_type: generate_until
6+
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nPor favor, responde con la letra correcta (A, B, C o D) sin absolutamente nada adicional, solo la letra correcta:"
7+
doc_to_target: "{{['A','B','C','D'][answer]}}"
8+
target_delimiter: ":"
9+
generation_kwargs:
10+
until:
11+
- "</s>"
12+
metric_list:
13+
- metric: exact_match
14+
aggregation: mean
15+
higher_is_better: true
16+
ignore_case: true
17+
ignore_punctuation: true
18+
filter_list:
19+
- name: default
20+
filter:
21+
- function: regex
22+
regex_pattern: "([ABCD])"
23+
- function: take_first
24+
metadata:
25+
version: 3.0
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
group: mmlu_redux_spanish_generative
2+
group_alias: mmlu_redux_spanish (generative)
3+
task:
4+
- group: stem_spanish
5+
task:
6+
- mmlu_stem_generative_spanish
7+
aggregate_metric_list:
8+
- metric: exact_match
9+
weight_by_size: true
10+
- group: other_spanish
11+
task:
12+
- mmlu_other_generative_spanish
13+
aggregate_metric_list:
14+
- metric: exact_match
15+
weight_by_size: true
16+
- group: social sciences_spanish
17+
task:
18+
- mmlu_social_sciences_generative_spanish
19+
aggregate_metric_list:
20+
- metric: exact_match
21+
weight_by_size: true
22+
# - group: humanities_spanish
23+
# task:
24+
# - mmlu_humanities_generative_spanish
25+
# aggregate_metric_list:
26+
# - metric: exact_match
27+
# weight_by_size: true
28+
aggregate_metric_list:
29+
- aggregation: mean
30+
metric: exact_match
31+
weight_by_size: true
32+
metadata:
33+
version: 3
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"dataset_name": "abstract_algebra"
2+
"description":
3+
"The following are multiple choice questions (with answers) about abstract\
4+
\ algebra.\n\n"
5+
"tag": "mmlu_stem_generative_spanish"
6+
"include": "_default_template_spanish_yaml"
7+
"task": "mmlu_abstract_algebra_generative_spanish"
8+
"task_alias": "abstract_algebra_spanish"
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"dataset_name": "anatomy"
2+
"description":
3+
"The following are multiple choice questions (with answers) about anatomy.\n\
4+
\n"
5+
"tag": "mmlu_stem_generative_spanish"
6+
"include": "_default_template_spanish_yaml"
7+
"task": "mmlu_anatomy_generative_spanish"
8+
"task_alias": "anatomy_spanish"
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"dataset_name": "astronomy"
2+
"description":
3+
"The following are multiple choice questions (with answers) about astronomy.\n\
4+
\n"
5+
"tag": "mmlu_stem_generative_spanish"
6+
"include": "_default_template_spanish_yaml"
7+
"task": "mmlu_astronomy_generative_spanish"
8+
"task_alias": "astronomy_spanish"
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"dataset_name": "business_ethics"
2+
"description":
3+
"The following are multiple choice questions (with answers) about business\
4+
\ ethics.\n\n"
5+
"tag": "mmlu_other_generative_spanish"
6+
"include": "_default_template_spanish_yaml"
7+
"task": "mmlu_business_ethics_generative_spanish"
8+
"task_alias": "business_ethics_spanish"
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"dataset_name": "clinical_knowledge"
2+
"description":
3+
"The following are multiple choice questions (with answers) about clinical\
4+
\ knowledge.\n\n"
5+
"tag": "mmlu_other_generative_spanish"
6+
"include": "_default_template_spanish_yaml"
7+
"task": "mmlu_clinical_knowledge_generative_spanish"
8+
"task_alias": "clinical_knowledge_spanish"
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"dataset_name": "college_biology"
2+
"description":
3+
"The following are multiple choice questions (with answers) about college\
4+
\ biology.\n\n"
5+
"tag": "mmlu_stem_generative_spanish"
6+
"include": "_default_template_spanish_yaml"
7+
"task": "mmlu_college_biology_generative_spanish"
8+
"task_alias": "college_biology_spanish"

0 commit comments

Comments
 (0)