You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md
+19-18Lines changed: 19 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,17 +1,17 @@
1
1
2
2
# Reproducing Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
3
3
4
-
As Meta Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Meta-Llama 3.1 models as datasets in the [3.1 evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This tutorial demonstrates how to reproduce metrics similar to our reported numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks.
4
+
As Meta Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Meta-Llama 3.1 models as datasets in the [3.1 evals Hugging Face 🤗 collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This tutorial demonstrates how to reproduce metrics similar to our reported numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks.
5
5
6
6
## Important Notes
7
7
8
8
1.**This tutorial is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, and the implementation may differ slightly from our internal evaluation, leading to minor differences in the reproduced numbers.
9
9
2.**Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Meta Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|`. It will not work with models that are not based on Llama 3.
10
10
11
11
12
-
### Huggingface setups
12
+
### Hugging Face 🤗 setups
13
13
14
-
In order to install correct lm-evaluation-harness version, please check the [reproducibility section](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#reproducibility) in Huggingface 🤗 Open LLM Leaderboard v2 About page. To add the VLLM dependency, we can do following:
14
+
In order to install correct lm-evaluation-harness version, please check the [reproducibility section](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#reproducibility) in Hugging Face 🤗 Open LLM Leaderboard v2 About page. To add the VLLM dependency, we can do following:
To access our [3.1 evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f), you must:
24
-
- Log in to the Huggingface website and click the 3.1 evals dataset pages and agree to the terms.
25
-
- Follow the [Huggingface authentication instructions](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication) to gain read access for your machine.
23
+
To access our [3.1 evals Hugging Face 🤗 collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f), you must:
24
+
- Log in to the Hugging Face 🤗 website and click the 3.1 evals dataset pages and agree to the terms.
25
+
- Follow the [Hugging Face 🤗 authentication instructions](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication) to gain read access for your machine.
26
26
27
-
It is recommended to read the dataset card to understand the meaning of each column and use the viewer feature in the Huggingface dataset to view our dataset. It is important to have some basic understanding of our dataset format and content before proceeding.
27
+
It is recommended to read the dataset card to understand the meaning of each column and use the viewer feature in the Hugging Face 🤗 dataset to view our dataset. It is important to have some basic understanding of our dataset format and content before proceeding.
28
28
29
29
### Task Selection
30
30
31
-
Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), here we will focus on tasks that overlap with the popular Huggingface 🤗 [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) as shown in the following:
31
+
Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), here we will focus on tasks that overlap with the popular Hugging Face 🤗 [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) as shown in the following:
32
32
33
33
-**Tasks for pretrained models**: BBH and MMLU-Pro
34
34
-**Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
35
35
36
-
Here, we aim to reproduce the Meta reported benchmark numbers on the aforementioned tasks using Huggingface 🤗 [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and reproduce our reported metrics.
36
+
Here, we aim to reproduce the Meta reported benchmark numbers on the aforementioned tasks using Hugging Face 🤗 [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and reproduce our reported metrics.
37
37
38
-
### Differences between our evaluation and Huggingface leaderboard evaluation
38
+
### Differences between our evaluation and Hugging Face 🤗 leaderboard evaluation
39
39
40
-
There are 3 major differences in terms of the eval configurations and prompts between this tutorial implementation and Huggingface 🤗 [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
40
+
There are 3 major differences in terms of the eval configurations and prompts between this tutorial implementation and Hugging Face 🤗 [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
41
41
42
-
-**Prompts**: We use Chain-of-Thought(COT) prompts while Huggingface leaderboard does not,
43
-
-**Task type**: For MMLU-Pro, BBH, GPQA tasks, we ask the model to generate response and score the parsed answer from generated response, while Huggingface leaderboard evaluation is comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ].
44
-
-**Inference**: We use internal LLM inference solution that loads pytorch checkpoints and do not use padding, while Huggingface leaderboard uses Huggingface format model and sometimes will use padding depending on the tasks type and batch size.
42
+
-**Prompts**: We use Chain-of-Thought(COT) prompts while Hugging Face 🤗 leaderboard does not. The prompts that define the output format are also sometime different.
43
+
-**Task type**: For MMLU-Pro, BBH, GPQA tasks, we ask the model to generate response and score the parsed answer from generated response, while Hugging Face 🤗 leaderboard evaluation is comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ].
44
+
-**Parsers**: For generative tasks, where the final answer needs to be parsed before scoring, the parser functions can be different between ours and Hugging Face 🤗 leaderboard evaluation, as our prompts that define the model output format are sometime designed differently.
45
+
-**Inference**: We use internal LLM inference solution that loads pytorch checkpoints and do not use padding, while Hugging Face 🤗 leaderboard uses Hugging Face 🤗 format model and sometimes will use padding depending on the tasks type and batch size.
45
46
46
-
Given those differences, our reproduced number can not be apple to apple compared to the numbers in the Huggingface 🤗 [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
47
+
Given those differences, our reproduced number can not be apple to apple compared to the numbers in the Hugging Face 🤗 [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
47
48
48
49
### Create task config
49
50
@@ -125,7 +126,7 @@ Once we have the yaml created, we can run the tasks using `lm-eval` CLI and use
125
126
126
127
**Padding**
127
128
128
-
By default, for the generative tasks, the `lm-eval --model_args="{...}" --batch_size=auto` command will use Huggingface inference solution that uses a static batch method with [left padding](https://github.com/EleutherAI/lm-evaluation-harness/blob/8ad598dfd305ece8c6c05062044442d207279a97/lm_eval/models/huggingface.py#L773) using EOS_token for Llama models. While our internal evaluation will load python original checkpoints and handle individual generation request asynchronously without any padding. To simulate this, we will use VLLM inference solution to do dynamic batching without any padding.
129
+
By default, for the generative tasks, the `lm-eval --model_args="{...}" --batch_size=auto` command will use Hugging Face 🤗 inference solution that uses a static batch method with [left padding](https://github.com/EleutherAI/lm-evaluation-harness/blob/8ad598dfd305ece8c6c05062044442d207279a97/lm_eval/models/huggingface.py#L773) using EOS_token for Llama models. While our internal evaluation will load python original checkpoints and handle individual generation request asynchronously without any padding. To simulate this, we will use VLLM inference solution to do dynamic batching without any padding.
129
130
130
131
**NOTE**: Since our prompts in the evals dataset has already included all the special tokens required by instruct model, such as `|start_header_id|>user<|end_header_id|>`, we will not use `--apply_chat_template` argument for instruct models anymore. However, we need to use `add_bos_token=True` flag to add the BOS_token back during VLLM inference, as the BOS_token is removed by default in [this PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465).
131
132
@@ -153,7 +154,7 @@ Moreover, we have modified this [math_hard/utils.py](./meta_template/math_hard/u
153
154
154
155
1. This python script only use [a regular expression "Final Answer: The final answer is(.*?). I hope it is correct."](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/leaderboard/math/utils.py#L192) to get the final answer, because this format is shown in the previous 4 shot examples prompts. However, our MATH Hard task is using 0 shot COT prompts that ask model to put the final answer into this string format `Therefore, the final answer is: $\\boxed{answer}$. I hope it is correct.` which can not be captured by previous regular expression, so we will use `\\box{}` to parse the final answer instead.
155
156
156
-
2. The [is_equiv(x1: str, x2: str)](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/leaderboard/math/utils.py#L144) function failed parse 78 ground truth, as we noticed some error logs like `[utils.py:158] couldn't parse one of [0,1) or [0,1)`, so all those questions will be marked as wrong. We will raise a issue about this problem and will add a string equality check statement before going to is_equiv() function for now as a temporal solution.
157
+
2. The [is_equiv(x1: str, x2: str)](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/leaderboard/math/utils.py#L144) function failed parse some ground truth, as we noticed some error logs like `[utils.py:158] couldn't parse one of [0,1) or [0,1)`, so all those questions will be marked as wrong. We raised [a issue to lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness/issues/2212) about this problem and will add a string equality check statement before going to is_equiv() function for now as a temporary solution.
157
158
158
159
159
160
**NOTE**: For `meta_ifeval` tasks, we have to use the original configs, such as `instruction_id_list`, `kwargs`, from [wis-k/instruction-following-eval](https://huggingface.co/datasets/wis-k/instruction-following-eval) in order to use [lm-evaluation-harness IFeval evaluation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard/ifeval). We will perform similar join back method using `get_ifeval_data` function in the [prepare_meta_eval.py](./prepare_meta_eval.py) to get a local parquet dataset file.
@@ -185,4 +186,4 @@ or it is expected as stated in [this comment](https://github.com/vllm-project/vl
185
186
186
187
## Acknowledgement
187
188
188
-
This tutorial is inspired by [leaderboard tasks implementation on the lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard) created by Huggingface 🤗 [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) team.
189
+
This tutorial is inspired by [leaderboard tasks implementation on the lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard) created by Hugging Face 🤗 [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) team.
0 commit comments