Skip to content

Commit d691843

Browse files
committed
fixed commit and add nltk download
1 parent 19dd9dc commit d691843

File tree

2 files changed

+10
-4
lines changed

2 files changed

+10
-4
lines changed

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Please install our lm-evaluation-harness and llama-recipe repo by following:
2828
```
2929
git clone [email protected]:EleutherAI/lm-evaluation-harness.git
3030
cd lm-evaluation-harness
31+
git checkout a4987bba6e9e9b3f22bd3a6c1ecf0abd04fd5622
3132
pip install -e .[math,ifeval,sentencepiece,vllm]
3233
cd ../
3334
git clone [email protected]:meta-llama/llama-recipes.git
@@ -203,6 +204,8 @@ Here is the comparison between our reported numbers and the reproduced numbers i
203204

204205
From the table above, we can see that most of our reproduced results are very close to our reported number in the [Meta Llama website](https://llama.meta.com/).
205206

207+
**NOTE**: We used the average of `inst_level_strict_acc,none` and `prompt_level_strict_acc,none` to get the final number for `IFeval` as stated [here](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#task-evaluations-and-parameters)
208+
206209
**NOTE**: In the [Meta Llama website](https://llama.meta.com/), we reported the `macro_avg` metric, which is the average of all subtask average score, for `MMLU-Pro `task, but here we are reproducing the `micro_avg` metric, which is the average score for all the individual samples, and those `micro_avg` numbers can be found in the [eval_details.md](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro).
207210

208211
**NOTE**: The reproduced numbers may be slightly different, as we observed around ±0.01 differences between each reproduce run because the latest VLLM inference is not very deterministic even with temperature=0. This behavior maybe related [this issue](https://github.com/vllm-project/vllm/issues/5404).

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/prepare_meta_eval.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@
22
# This software may be used and distributed according to the terms of the Llama 3 Community License Agreement.
33

44
import argparse
5-
import errno, shutil
5+
import errno
6+
import shutil
67
import glob
78
import os
89
from pathlib import Path
9-
10+
import nltk
1011
import yaml
1112
from datasets import Dataset, load_dataset
1213

@@ -51,7 +52,7 @@ def get_ifeval_data(model_name, output_dir):
5152
]
5253
)
5354
joined.rename_column("output_prediction_text", "previous_output_prediction_text")
54-
joined.to_parquet(output_dir + f"/joined_ifeval.parquet")
55+
joined.to_parquet(output_dir + "/joined_ifeval.parquet")
5556

5657

5758
# get the math_hard data from the evals dataset and join it with the original math_hard dataset
@@ -94,7 +95,7 @@ def get_math_data(model_name, output_dir):
9495
"output_prediction_text", "previous_output_prediction_text"
9596
)
9697

97-
joined.to_parquet(output_dir + f"/joined_math.parquet")
98+
joined.to_parquet(output_dir + "/joined_math.parquet")
9899

99100

100101
# get the question from the ifeval dataset
@@ -137,6 +138,8 @@ def change_yaml(args, base_name):
137138

138139
# copy the files and change the yaml file to use the correct model name
139140
def copy_and_prepare(args):
141+
# nltk punkt_tab package is needed
142+
nltk.download('punkt_tab')
140143
if not os.path.exists(args.work_dir):
141144
# Copy the all files, including yaml files and python files, from template folder to the work folder
142145

0 commit comments

Comments
 (0)