FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092)

userljz · web-flow · commit a7ca04353fe1 · 2025-06-30T21:25:37.000+05:00
* Fix: Align the Humaneval dataset with official results

Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".

(2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.

Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).

Ref: PR#2650

* add changelog and version

* add changelog
diff --git a/lm_eval/tasks/humaneval/README.md b/lm_eval/tasks/humaneval/README.md
@@ -50,3 +50,5 @@ If other tasks on this dataset are already supported:
 
 ### Changelog
 v2 20-MAR-2025: `humaneval_instruct`, `humaneval_instruct_64`: fixed typo in gen_prefix
+
+v3 30-JUN-2025: Updated prompt generation and output parsing to align with the official `Llama-3.1-70B-Instruct-evals`. This corrects the prompt format and fixes a bug in locating the code block. See PR [#3092](https://github.com/EleutherAI/lm-evaluation-harness/pull/3092).
diff --git a/lm_eval/tasks/humaneval/humaneval_instruct.yaml b/lm_eval/tasks/humaneval/humaneval_instruct.yaml
@@ -1,11 +1,11 @@
 include: humaneval.yaml
 task: humaneval_instruct
-doc_to_text: "Write a solution to the following problem and make sure that it passes the tests:\n```{{prompt}}"
-gen_prefix: "Here is the completed function:\n```python\n{{prompt}}\n"
+doc_to_text: 'Write a solution to the following problem and make sure that it passes the tests:\n```python\n{{ prompt }}\n```\n '
+gen_prefix: 'Here is the completed function:\n```python\n{{ prompt }}\n '
 filter_list:
   - name: "create_test"
     filter:
       - function: "custom"
         filter_fn: !function utils.build_predictions_instruct
 metadata:
-  version: 2.0
+  version: 3.0
diff --git a/lm_eval/tasks/humaneval/utils.py b/lm_eval/tasks/humaneval/utils.py
@@ -32,7 +32,7 @@ def build_predictions_instruct(
 ) -> list[list[str]]:
     return [
         [
-            doc["prompt"] + (r if r.rfind("```") == -1 else r[: r.rfind("```")])
+            doc["prompt"] + (r if r.find("```") == -1 else r[: r.find("```")])
             for r in resp
         ]
         for resp, doc in zip(resps, docs)

Original file line number	Diff line number	Diff line change
`@@ -32,7 +32,7 @@ def build_predictions_instruct(`
`32`	`32`	`) -> list[list[str]]:`
`33`	`33`	`return [`
`34`	`34`	`[`
`35`		- doc["prompt"] + (r if r.rfind("```") == -1 else r[: r.rfind("```")])
	`35`	+ doc["prompt"] + (r if r.find("```") == -1 else r[: r.find("```")])
`36`	`36`	`for r in resp`
`37`	`37`	`]`
`38`	`38`	`for resp, doc in zip(resps, docs)`