Skip to content

Commit a7ca043

Browse files
authored
FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092)
* Fix: Align the Humaneval dataset with official results Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals". (2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one. Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5). Ref: PR#2650 * add changelog and version * add changelog
1 parent fea4d11 commit a7ca043

File tree

3 files changed

+6
-4
lines changed

3 files changed

+6
-4
lines changed

lm_eval/tasks/humaneval/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,3 +50,5 @@ If other tasks on this dataset are already supported:
5050

5151
### Changelog
5252
v2 20-MAR-2025: `humaneval_instruct`, `humaneval_instruct_64`: fixed typo in gen_prefix
53+
54+
v3 30-JUN-2025: Updated prompt generation and output parsing to align with the official `Llama-3.1-70B-Instruct-evals`. This corrects the prompt format and fixes a bug in locating the code block. See PR [#3092](https://github.com/EleutherAI/lm-evaluation-harness/pull/3092).
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
include: humaneval.yaml
22
task: humaneval_instruct
3-
doc_to_text: "Write a solution to the following problem and make sure that it passes the tests:\n```{{prompt}}"
4-
gen_prefix: "Here is the completed function:\n```python\n{{prompt}}\n"
3+
doc_to_text: 'Write a solution to the following problem and make sure that it passes the tests:\n```python\n{{ prompt }}\n```\n '
4+
gen_prefix: 'Here is the completed function:\n```python\n{{ prompt }}\n '
55
filter_list:
66
- name: "create_test"
77
filter:
88
- function: "custom"
99
filter_fn: !function utils.build_predictions_instruct
1010
metadata:
11-
version: 2.0
11+
version: 3.0

lm_eval/tasks/humaneval/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ def build_predictions_instruct(
3232
) -> list[list[str]]:
3333
return [
3434
[
35-
doc["prompt"] + (r if r.rfind("```") == -1 else r[: r.rfind("```")])
35+
doc["prompt"] + (r if r.find("```") == -1 else r[: r.find("```")])
3636
for r in resp
3737
]
3838
for resp, doc in zip(resps, docs)

0 commit comments

Comments
 (0)