Skip to content

Commit d74507a

Browse files
committed
use pip install for lm_eval
1 parent d691843 commit d74507a

File tree

1 file changed

+2
-6
lines changed
  • tools/benchmarks/llm_eval_harness/meta_eval_reproduce

1 file changed

+2
-6
lines changed

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,7 @@ Given those differences, our reproduced number can not be compared to the number
2626
Please install our lm-evaluation-harness and llama-recipe repo by following:
2727

2828
```
29-
git clone [email protected]:EleutherAI/lm-evaluation-harness.git
30-
cd lm-evaluation-harness
31-
git checkout a4987bba6e9e9b3f22bd3a6c1ecf0abd04fd5622
32-
pip install -e .[math,ifeval,sentencepiece,vllm]
33-
cd ../
29+
pip install lm-eval[math,ifeval,sentencepiece,vllm]==0.4.3
3430
git clone [email protected]:meta-llama/llama-recipes.git
3531
cd llama-recipes
3632
pip install -U pip setuptools
@@ -204,7 +200,7 @@ Here is the comparison between our reported numbers and the reproduced numbers i
204200

205201
From the table above, we can see that most of our reproduced results are very close to our reported number in the [Meta Llama website](https://llama.meta.com/).
206202

207-
**NOTE**: We used the average of `inst_level_strict_acc,none` and `prompt_level_strict_acc,none` to get the final number for `IFeval` as stated [here](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#task-evaluations-and-parameters)
203+
**NOTE**: We used the average of `inst_level_strict_acc,none` and `prompt_level_strict_acc,none` to get the final number for `IFeval` as stated [here](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#task-evaluations-and-parameters).
208204

209205
**NOTE**: In the [Meta Llama website](https://llama.meta.com/), we reported the `macro_avg` metric, which is the average of all subtask average score, for `MMLU-Pro `task, but here we are reproducing the `micro_avg` metric, which is the average score for all the individual samples, and those `micro_avg` numbers can be found in the [eval_details.md](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro).
210206

0 commit comments

Comments
 (0)