Skip to content

Commit 57648ba

Browse files
author
Ziqing Huang
committed
Add uncheatable eval
1 parent f83f960 commit 57648ba

20 files changed

+629
-0
lines changed

lm_eval/tasks/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,7 @@ provided to the individual README.md files for each subfolder.
179179
| [truthfulqa-multi](truthfulqa-multi/README.md) | Is a multilingual version of TruthfulQA, a QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English, Spanish, Catalan, Basque, Galician |
180180
| [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish |
181181
| [turblimp_core](turblimp/README.md) | A benchmark evaluating language models' grammatical capabilities in Turkish based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences. | Turkish |
182+
| [uncheatable_eval](uncheatable_eval/README.md) | Rolling perplexity benchmark built from Uncheatable Eval dumps covering Wikipedia, GitHub, BBC, arXiv, and AO3 domains scraped after mid-2024. | English, Spanish, French, German, Japanese, Arabic, Chinese, Python, C++ |
182183
| [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English |
183184
| [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English |
184185
| [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English |
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Uncheatable Eval
2+
3+
These tasks evaluate autoregressive language models on [Uncheatable Eval](https://github.com/Jellyfish042/uncheatable_eval). Each task measures rolling log-likelihood over newly-generated documents across Wikipedia, GitHub, BBC, arXiv, and AO3.
4+
5+
### Citation
6+
7+
```text
8+
@software{uncheatable_eval,
9+
author = {Jellyfish042},
10+
title = {Uncheatable Eval},
11+
month = may,
12+
year = 2024,
13+
publisher = {Zenodo},
14+
version = {0.1},
15+
doi = {10.5281/zenodo.11284692},
16+
url = {https://zenodo.org/record/11284692}
17+
}
18+
```
19+
20+
### Groups, Tags, and Tasks
21+
22+
#### Groups
23+
24+
* `uncheatable_eval`: aggregating Wikipedia (English), GitHub (Python/C++), BBC News, arXiv (physics + CS), and AO3 (English).
25+
* `uncheatable_eval_full`: spanning every available Uncheatable Eval dump, including all supported Wikipedia languages plus GitHub, BBC, arXiv, and AO3 (English + Chinese).
26+
27+
#### Tags
28+
29+
* `uncheatable_eval`
30+
31+
#### Tasks
32+
33+
* `uncheatable_eval_wikipedia_english`
34+
* `uncheatable_eval_wikipedia_spanish`
35+
* `uncheatable_eval_wikipedia_french`
36+
* `uncheatable_eval_wikipedia_german`
37+
* `uncheatable_eval_wikipedia_japanese`
38+
* `uncheatable_eval_wikipedia_arabic`
39+
* `uncheatable_eval_wikipedia_chinese`
40+
* `uncheatable_eval_github_python`
41+
* `uncheatable_eval_github_cpp`
42+
* `uncheatable_eval_bbc_news`
43+
* `uncheatable_eval_arxiv_physics`
44+
* `uncheatable_eval_arxiv_computer_science`
45+
* `uncheatable_eval_ao3_english`
46+
* `uncheatable_eval_ao3_chinese`
47+
48+
### Changelog
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
output_type: loglikelihood_rolling
2+
test_split: test
3+
doc_to_text: ""
4+
doc_to_target: "{{text}}"
5+
description: >-
6+
Rolling log-likelihood evaluation over deduplicated Uncheatable Eval
7+
documents sourced from freshly scraped corpora.
8+
tag:
9+
- uncheatable_eval
10+
should_decontaminate: false
11+
metric_list:
12+
- metric: word_perplexity
13+
aggregation: weighted_perplexity
14+
higher_is_better: false
15+
- metric: byte_perplexity
16+
aggregation: weighted_perplexity
17+
higher_is_better: false
18+
- metric: bits_per_byte
19+
aggregation: bits_per_byte
20+
higher_is_better: false
21+
metadata:
22+
version: 1.0
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
group: uncheatable_eval
2+
group_alias: Uncheatable Eval (core domains)
3+
task:
4+
- uncheatable_eval_wikipedia_english
5+
- uncheatable_eval_github_python
6+
- uncheatable_eval_github_cpp
7+
- uncheatable_eval_bbc_news
8+
- uncheatable_eval_arxiv_physics
9+
- uncheatable_eval_arxiv_computer_science
10+
- uncheatable_eval_ao3_english
11+
aggregate_metric_list:
12+
- metric: word_perplexity
13+
aggregation: mean
14+
weight_by_size: true
15+
- metric: byte_perplexity
16+
aggregation: mean
17+
weight_by_size: true
18+
- metric: bits_per_byte
19+
aggregation: mean
20+
weight_by_size: true
21+
metadata:
22+
version: 1.0
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
include: _uncheatable_eval_base.yaml
2+
task: uncheatable_eval_ao3_chinese
3+
task_alias: UE_ao3_zh
4+
description: >-
5+
Rolling perplexity on Uncheatable Eval Archive of Our Own fanfiction (Chinese)
6+
scraped after mid-2024.
7+
custom_dataset: !function uncheatable_eval_utils.load_uncheatable_eval
8+
dataset_kwargs:
9+
dataset: ao3_chinese
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
include: _uncheatable_eval_base.yaml
2+
task: uncheatable_eval_ao3_english
3+
task_alias: UE_ao3_en
4+
description: >-
5+
Rolling perplexity on Uncheatable Eval Archive of Our Own fanfiction (English)
6+
scraped after mid-2024.
7+
custom_dataset: !function uncheatable_eval_utils.load_uncheatable_eval
8+
dataset_kwargs:
9+
dataset: ao3_english
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
include: _uncheatable_eval_base.yaml
2+
task: uncheatable_eval_arxiv_computer_science
3+
task_alias: UE_arxiv_cs
4+
description: >-
5+
Rolling perplexity on Uncheatable Eval arXiv computer science papers and
6+
abstracts downloaded after mid-2024.
7+
custom_dataset: !function uncheatable_eval_utils.load_uncheatable_eval
8+
dataset_kwargs:
9+
dataset: arxiv_computer_science
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
include: _uncheatable_eval_base.yaml
2+
task: uncheatable_eval_arxiv_physics
3+
task_alias: UE_arxiv_ph
4+
description: >-
5+
Rolling perplexity on Uncheatable Eval arXiv physics papers and abstracts
6+
downloaded after mid-2024.
7+
custom_dataset: !function uncheatable_eval_utils.load_uncheatable_eval
8+
dataset_kwargs:
9+
dataset: arxiv_physics
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
include: _uncheatable_eval_base.yaml
2+
task: uncheatable_eval_bbc_news
3+
task_alias: UE_bbc_news
4+
description: >-
5+
Rolling perplexity on Uncheatable Eval BBC News articles harvested after
6+
mid-2024.
7+
custom_dataset: !function uncheatable_eval_utils.load_uncheatable_eval
8+
dataset_kwargs:
9+
dataset: bbc_news
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
group: uncheatable_eval_full
2+
group_alias: Uncheatable Eval (full)
3+
task:
4+
- uncheatable_eval_wikipedia_english
5+
- uncheatable_eval_wikipedia_spanish
6+
- uncheatable_eval_wikipedia_french
7+
- uncheatable_eval_wikipedia_german
8+
- uncheatable_eval_wikipedia_japanese
9+
- uncheatable_eval_wikipedia_arabic
10+
- uncheatable_eval_wikipedia_chinese
11+
- uncheatable_eval_github_python
12+
- uncheatable_eval_github_cpp
13+
- uncheatable_eval_bbc_news
14+
- uncheatable_eval_arxiv_physics
15+
- uncheatable_eval_arxiv_computer_science
16+
- uncheatable_eval_ao3_english
17+
- uncheatable_eval_ao3_chinese
18+
aggregate_metric_list:
19+
- metric: word_perplexity
20+
aggregation: mean
21+
weight_by_size: true
22+
- metric: byte_perplexity
23+
aggregation: mean
24+
weight_by_size: true
25+
- metric: bits_per_byte
26+
aggregation: mean
27+
weight_by_size: true
28+
metadata:
29+
version: 1.0

0 commit comments

Comments
 (0)