Update tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

wukaixingxp · HamidShojanazeri · web-flow · commit 145006866e12 · 2024-08-20T17:28:29.000-07:00
Co-authored-by: Hamid Shojanazeri &lt;hamid.nazeri2010@gmail.com&gt;
diff --git a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md
@@ -15,7 +15,7 @@ As Meta Llama models gain popularity, evaluating these models has become increas
 There are 4 major differences in terms of the eval configurations and prompts between this tutorial implementation and Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
 
 - **Prompts**: We use Chain-of-Thought(COT) prompts while Hugging Face leaderboard does not. The prompts that define the output format are also different.
-- **Task type**: For MMLU-Pro, BBH, GPQA tasks, we ask the model to generate response and score the parsed answer from generated response, while Hugging Face leaderboard evaluation is comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ].
+- **Metric calculation**: For MMLU-Pro, BBH, GPQA tasks, we ask the model to generate response and score the parsed answer from generated response, while Hugging Face leaderboard evaluation is comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ].
 - **Parsers**: For generative tasks, where the final answer needs to be parsed before scoring, the parser functions can be different between ours and Hugging Face leaderboard evaluation, as our prompts that define the model output format are designed differently.
 - **Inference**: We use internal LLM inference solution that loads pytorch checkpoints and do not use padding, while Hugging Face leaderboard uses Hugging Face format model and sometimes will use padding depending on the tasks type and batch size.
 - ** Tasks**  We run benchmarks on BBH and MMLU-Pro only for pretrained models and Math-Hard, IFeval, GPQA, only for pretrained models.