Skip to content

Commit ae10920

Browse files
Update tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md
Co-authored-by: Hamid Shojanazeri <[email protected]>
1 parent ef1f4c8 commit ae10920

File tree

1 file changed

+1
-1
lines changed
  • tools/benchmarks/llm_eval_harness/meta_eval_reproduce

1 file changed

+1
-1
lines changed

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ As Meta Llama models gain popularity, evaluating these models has become increas
1111

1212
### Differences between our evaluation and Hugging Face leaderboard evaluation
1313

14-
There are 4 major differences in terms of the eval configurations and prompts between this tutorial implementation and Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
14+
There are 4 major differences in terms of the eval configurations and prompting methods between this implementation and Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
1515

1616
- **Prompts**: We use Chain-of-Thought(COT) prompts while Hugging Face leaderboard does not. The prompts that define the output format are also different.
1717
- **Metric calculation**: For MMLU-Pro, BBH, GPQA tasks, we ask the model to generate response and score the parsed answer from generated response, while Hugging Face leaderboard evaluation is comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ].

0 commit comments

Comments
 (0)