Skip to content

Commit a4752cc

Browse files
its-alpeshbaberabb
andauthored
Add humaneval_infilling task (#3299)
* Add humaneval_infilling task * pacify pre-commit --------- Co-authored-by: Baber Abbasi <[email protected]>
1 parent 6b8ec14 commit a4752cc

File tree

8 files changed

+134
-1
lines changed

8 files changed

+134
-1
lines changed

lm_eval/tasks/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,14 +79,15 @@ provided to the individual README.md files for each subfolder.
7979
| [histoires_morales](histoires_morales/README.md) | A dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | French (Some MT) |
8080
| [hrm8k](hrm8k/README.md) | A challenging bilingual math reasoning benchmark for Korean and English. | Korean (Some MT), English (Some MT) |
8181
| [humaneval](humaneval/README.md) | Code generation task that measure functional correctness for synthesizing programs from docstrings. | Python |
82+
| [humaneval_infilling](humaneval_infilling/README.md) | Code generation task that measure fill-in-the-middle capability for synthesizing programs from docstrings. | Python |
8283
| [icelandic_winogrande](icelandic_winogrande/README.md) | Manually translated and localized version of the [WinoGrande](winogrande/README.md) commonsense reasoning benchmark for Icelandic. | Icelandic |
8384
| [ifeval](ifeval/README.md) | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English |
8485
| [inverse_scaling](inverse_scaling/README.md) | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English |
8586
| [japanese_leaderboard](japanese_leaderboard/README.md) | Japanese language understanding tasks to benchmark model performance on various linguistic aspects. | Japanese |
8687
| [jsonschema_bench](jsonschema_bench/README.md) | Evaluate the ability of LLMs to generate JSON objects that conform to a given JSON schema, including API, configuration files, and other structured data formats. | JSON |
8788
| [kbl](kbl/README.md) | Korean Benchmark for Legal Language Understanding. | Korean |
8889
| [kmmlu](kmmlu/README.md) | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean |
89-
| [kobest](kobest/README.md) | A collection of tasks designed to evaluate understanding in Korean language{Fecha: language. | Korean |
90+
| [kobest](kobest/README.md) | A collection of tasks designed to evaluate understanding in Korean language{Fecha: language. | Korean |
9091
| [kormedmcqa](kormedmcqa/README.md) | Medical question answering tasks in Korean to test specialized domain knowledge. | Korean |
9192
| [lambada](lambada/README.md) | Tasks designed to predict the endings of text passages, testing language prediction skills. | English |
9293
| [lambada_cloze](lambada_cloze/README.md) | Cloze-style LAMBADA dataset. | English |
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Humaneval-Infilling
2+
3+
### Paper
4+
5+
Title: Efficient Training of Language Models to Fill in the Middle
6+
Abstract: https://arxiv.org/pdf/2207.14255
7+
8+
We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
9+
10+
Homepage: https://github.com/openai/human-eval-infilling
11+
12+
13+
### Citation
14+
15+
```
16+
@article{bavarian2022efficient,
17+
title={Efficient Training of Language Models to Fill in the Middle},
18+
author={Bavarian, Mohammad and Jun, Heewoo and Tezak, Nikolas and Schulman, John and McLeavey, Christine and Tworek, Jerry and Chen, Mark},
19+
journal={arXiv preprint arXiv:2207.14255},
20+
year={2022}
21+
}
22+
```
23+
24+
### Groups and Tasks
25+
26+
#### Groups
27+
28+
- `humaneval_infilling`
29+
30+
This dataset has 4 subsets: HumanEval-MultiLineInfilling, HumanEval-SingleLineInfilling, HumanEval-RandomSpanInfilling, HumanEval-RandomSpanInfillingLight. The single-line, multi-line, random span infilling and its light version have 1033, 5815, 1640 and 164 tasks, respectively.
31+
32+
#### Tasks
33+
34+
- `humaneval_single_line_infilling`
35+
- `humaneval_multi_line_infilling`
36+
- `humaneval_random_span_infilling`
37+
- `humaneval_random_span_infilling_light`
38+
39+
### Checklist
40+
41+
For adding novel benchmarks/datasets to the library:
42+
43+
- [ ] Is the task an existing benchmark in the literature?
44+
- [ ] Have you referenced the original paper that introduced the task?
45+
- [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
46+
47+
If other tasks on this dataset are already supported:
48+
49+
- [ ] Is the "Main" variant of this task clearly denoted?
50+
- [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
51+
- [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
group: humaneval_infilling
2+
task:
3+
- humaneval_multi_line_infilling
4+
- humaneval_single_line_infilling
5+
- humaneval_random_span_infilling
6+
- humaneval_random_span_infilling_light
7+
aggregate_metric_list:
8+
- metric: pass@1
9+
aggregation: mean
10+
weight_by_size: false
11+
metadata:
12+
version: 1.0
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
task: humaneval_multi_line_infilling
2+
dataset_path: loubnabnl/humaneval_infilling
3+
dataset_name: HumanEval-MultiLineInfilling
4+
unsafe_code: true
5+
output_type: generate_until
6+
test_split: test
7+
doc_to_text: "{{suffix}}\n\n{{prompt}}"
8+
doc_to_target: "{{test}}\ncheck({{entry_point}})"
9+
metric_list:
10+
- metric: !function utils.pass_at_k
11+
aggregation: mean
12+
higher_is_better: true
13+
k: [1]
14+
generation_kwargs:
15+
max_gen_toks: 1024
16+
do_sample: false
17+
repeats: 1
18+
num_fewshot: 0
19+
filter_list:
20+
- name: "create_test"
21+
filter:
22+
- function: "custom"
23+
filter_fn: !function utils.build_predictions
24+
metadata:
25+
version: 1.0
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: multi_line_infilling.yaml
2+
task: humaneval_random_span_infilling
3+
dataset_name: HumanEval-RandomSpanInfilling
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: multi_line_infilling.yaml
2+
task: humaneval_single_line_infilling_light
3+
dataset_name: HumanEval-RandomSpanInfillingLight
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
include: multi_line_infilling.yaml
2+
task: humaneval_single_line_infilling
3+
dataset_name: HumanEval-SingleLineInfilling
4+
generation_kwargs:
5+
until:
6+
- "\n"
7+
max_gen_toks: 1024
8+
do_sample: false
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import evaluate as hf_evaluate
2+
3+
4+
try:
5+
compute_ = hf_evaluate.load("code_eval")
6+
test_cases = ["assert add(2, 3)==5"]
7+
candidates = [["def add(a,b): return a*b"]]
8+
results = compute_.compute(references=test_cases, predictions=candidates, k=[1])
9+
except Exception as e:
10+
raise e
11+
12+
13+
def pass_at_k(references: list[str], predictions: list[list[str]], k: list[int] = None):
14+
global compute_
15+
assert k is not None
16+
if isinstance(k, int):
17+
k = [k]
18+
res = compute_.compute(
19+
references=references,
20+
predictions=predictions,
21+
k=k,
22+
)
23+
return res[0]
24+
25+
26+
def build_predictions(resps: list[list[str]], docs: list[dict]) -> list[list[str]]:
27+
return [
28+
[doc["prompt"] + r + doc["suffix"] for r in resp]
29+
for resp, doc in zip(resps, docs)
30+
]

0 commit comments

Comments
 (0)