Skip to content

Commit cdb4253

Browse files
younesbelkadabeacnascimentoCaioRhodenDanielGardinGiovaniValdrighi
authored
Feat: Add E2LMC team morai submission from Neurips 2026 competition (#3443)
* Add Morai submission Co-authored-by: Beatriz <[email protected]> Co-authored-by: Caio Emanuel Rhoden <[email protected]> Co-authored-by: Daniel Gardin <[email protected]> Co-authored-by: Giovani Valdrighi <[email protected]> * add README * move to `e2lmc/` --------- Co-authored-by: Beatriz <[email protected]> Co-authored-by: Caio Emanuel Rhoden <[email protected]> Co-authored-by: Daniel Gardin <[email protected]> Co-authored-by: Giovani Valdrighi <[email protected]>
1 parent bc7a58b commit cdb4253

File tree

5 files changed

+68
-0
lines changed

5 files changed

+68
-0
lines changed

lm_eval/api/metrics.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,16 @@ def perplexity_fn(items): # This is a passthrough function
265265
return items
266266

267267

268+
@register_metric(
269+
metric="likelihood",
270+
higher_is_better=True,
271+
output_type="multiple_choice",
272+
aggregation="mean",
273+
)
274+
def likelihood_fn(items): # This is a passthrough function
275+
return items
276+
277+
268278
@register_metric(
269279
metric="word_perplexity",
270280
higher_is_better=False,

lm_eval/api/task.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1691,6 +1691,7 @@ def process_results(self, doc, results):
16911691
if "brier_score" in use_metric
16921692
else {}
16931693
),
1694+
**({"likelihood": (gold, lls)} if "likelihood" in use_metric else {}),
16941695
}
16951696

16961697
if "acc_mutual_info" in use_metric:
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# MMLU Early Training
2+
3+
This is a update of the MMLU benchmark (Hendrycks et al.) to evaluate models at early training stages. MMLU consists of multiple-choice questions from various branches of knowledge, such as humanities, social sciences, hard sciences. The dataset was created by selecting the 50% "easiest questions", that is, question in which there is a learning signal at the early training stages. Furthermore, MMLU is an multiple-choice dataset, and in this task, choices are not present in the prompt, and the target is the complete choice text, not only the choice letter. To further ensure signal at early training stages, the metric is defined as the difference of the log-likelihood of the correct choice and the average log-likelihood among incorrect choices.
4+
5+
### Groups, Tags, and Tasks
6+
7+
#### Groups
8+
- Not part of a group yet.
9+
#### Tasks
10+
- `mmlu_early_training`
11+
12+
### Checklist
13+
14+
For adding novel benchmarks/datasets to the library:
15+
* [ ] Is the task an existing benchmark in the literature?
16+
* [ ] Have you referenced the original paper that introduced the task?
17+
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
18+
19+
20+
If other tasks on this dataset are already supported:
21+
* [ ] Is the "Main" variant of this task clearly denoted?
22+
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
23+
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import numpy as np
2+
3+
4+
def loglikelihood_diff(items):
5+
diffs = []
6+
for item in items:
7+
target, lls = item
8+
target_ll = lls[target]
9+
others_ll = [ll for i, ll in enumerate(lls) if i != target]
10+
mean_others_ll = np.mean(others_ll)
11+
diff = target_ll - mean_others_ll
12+
diffs.append(diff)
13+
14+
return np.mean(diffs)
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
task: mmlu_early_training
2+
dataset_path: giovanivaldrighi/mmlu
3+
test_split: test
4+
fewshot_split: dev
5+
num_fewshot: 5
6+
fewshot_config:
7+
sampler: first_n
8+
output_type: multiple_choice
9+
"description": "The following are multiple choice questions (with answers) about {{subject}}.\n\n"
10+
doc_to_text: "Question: {{question.strip()}}\nAnswer:"
11+
doc_to_choice: "{{choices}}"
12+
doc_to_target: "{{answer}}"
13+
metric_list:
14+
- metric: likelihood
15+
aggregation: !function custom_metrics.loglikelihood_diff
16+
high_is_better: true
17+
metadata:
18+
version: 1.0
19+
dataset_kwargs:
20+
trust_remote_code: true

0 commit comments

Comments
 (0)