You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md
+6-4Lines changed: 6 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ As Meta Llama models gain popularity, evaluating these models has become increas
6
6
## Important Notes
7
7
8
8
1.**This tutorial is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, and the implementation may differ slightly from our internal evaluation, leading to minor differences in the reproduced numbers.
9
-
2.**Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Meta Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|`. It will not work with models that are not based on Llama 3.
9
+
2.**Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Meta Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|>`. It will not work with models that are not based on Llama 3.
10
10
11
11
12
12
### Hugging Face setups
@@ -39,12 +39,12 @@ Here, we aim to reproduce the Meta reported benchmark numbers on the aforementio
39
39
40
40
There are 4 major differences in terms of the eval configurations and prompts between this tutorial implementation and Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
41
41
42
-
-**Prompts**: We use Chain-of-Thought(COT) prompts while Hugging Face leaderboard does not. The prompts that define the output format are also sometime different.
42
+
-**Prompts**: We use Chain-of-Thought(COT) prompts while Hugging Face leaderboard does not. The prompts that define the output format are also different.
43
43
-**Task type**: For MMLU-Pro, BBH, GPQA tasks, we ask the model to generate response and score the parsed answer from generated response, while Hugging Face leaderboard evaluation is comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ].
44
-
-**Parsers**: For generative tasks, where the final answer needs to be parsed before scoring, the parser functions can be different between ours and Hugging Face leaderboard evaluation, as our prompts that define the model output format are sometime designed differently.
44
+
-**Parsers**: For generative tasks, where the final answer needs to be parsed before scoring, the parser functions can be different between ours and Hugging Face leaderboard evaluation, as our prompts that define the model output format are designed differently.
45
45
-**Inference**: We use internal LLM inference solution that loads pytorch checkpoints and do not use padding, while Hugging Face leaderboard uses Hugging Face format model and sometimes will use padding depending on the tasks type and batch size.
46
46
47
-
Given those differences, our reproduced number can not be apple to apple compared to the numbers in the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
47
+
Given those differences, our reproduced number can not be compared to the numbers in the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
If you want to run evaluation on 70B-Instruct, then it is recommended to change the `dataset_path` and `dataset_name` from 8B to 70B, even though 70B-instruct and 8B-instruct share the same prompts, the `is_correct` column, which can be used to get the difference between current reproduced result and the reported results for each sample, is different.
65
+
64
66
**Note**: Config files for Meta-Llama-3.1-8B-Instruct are already provided in each task subfolder under [meta_template folder](./meta_template/). Remember to change the eval dataset name according to the model type and DO NOT use pretrained evals dataset on instruct models or vice versa.
65
67
66
68
**2.Configure preprocessing, prompts and ground truth**
example["input_question"] =eval(example["input_question"].replace("null","None").replace("true","True").replace("false","False"))["dialog"][0]["body"].replace("Is it True that the first song","Is it true that the first song").replace("Is the following True","Is the following true")
@@ -81,15 +75,8 @@ def get_question(example):
81
75
except:
82
76
print(example["input_question"])
83
77
return
84
-
defcheck_sample(example):
85
-
if"kwargs"inexampleandnotexample["kwargs"]:
86
-
print(example)
87
-
raiseValueError("This example did not got joined for IFeval")
88
-
if"solution"inexampleandnotexample["solution"]:
89
-
print(example)
90
-
raiseValueError("This example did not got joined for MATH_hard")
91
-
92
78
79
+
# change the yaml file to use the correct model name
0 commit comments