Skip to content

Commit 091f71e

Browse files
committed
now uses lm_eval cli instead
1 parent fe2b9f0 commit 091f71e

File tree

5 files changed

+215
-281
lines changed

5 files changed

+215
-281
lines changed

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

Lines changed: 37 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -28,18 +28,33 @@ It is recommended to read the dataset card to understand the meaning of each col
2828

2929
### Task Selection
3030

31-
Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), this tutorial will focus on tasks that overlap with the popular Huggingface 🤗 [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), such as BBH and MMLU-Pro for pretrained models, and Math-Hard, IFeval, GPQA, and MMLU-Pro for instruct models. This tutorial serves as an example to demonstrate the reproduction process. The implementation will be based on the Huggingface 🤗 [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard) and make necessary modifications to use our eval prompts and reproduce our reported metrics in [Meta Llama website](https://llama.meta.com/).
31+
Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), here we will focus on tasks that overlap with the popular Huggingface 🤗 [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) as shown in the following:
3232

33+
- **Tasks for pretrained models**: BBH and MMLU-Pro
34+
- **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
3335

34-
**NOTE**: There are many differences in terms of the eval configurations and prompts between this tutorial implementation and Huggingface 🤗 [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). For example, we use Chain-of-Thought(COT) prompts while Huggingface leaderboard does not, so the result numbers can not be apple to apple compared.
36+
Here, we aim to reproduce the Meta reported benchmark numbers on the aforementioned tasks using Huggingface 🤗 [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and reproduce our reported metrics.
3537

36-
### Create task yaml
38+
### Differences between our evaluation and Huggingface leaderboard evaluation
39+
40+
There are 3 major differences in terms of the eval configurations and prompts between this tutorial implementation and Huggingface 🤗 [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
41+
42+
- **Prompts**: We use Chain-of-Thought(COT) prompts while Huggingface leaderboard does not,
43+
- **Task type**: For MMLU-Pro, BBH, GPQA tasks, we ask the model to generate response and score the parsed answer from generated response, while Huggingface leaderboard evaluation is comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ].
44+
- **Inference**: We use internal LLM inference solution that loads pytorch checkpoints and do not use padding, while Huggingface leaderboard uses Huggingface format model and sometimes will use padding depanding on the tasks type and batch size.
45+
46+
Given those differences, our reproduced number can not be apple to apple compared to the numbers in the Huggingface 🤗 [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
47+
48+
### Create task config
3749

3850
In order to use lm-evaluation-harness, we need to follow the lm-evaluation-harness [new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to create a yaml file. We will use MMLU-Pro as a example to show the steps with detailed explanations:
3951

4052
**1.Define the config to load datasets**
4153

42-
We can use our 3.1 evals dataset as the source dataset and the corresponding subset and define the test split to latest. For example, if we want to reproduce the MMLU_Pro metric for 3.1 8B instruct, we should write the following yaml sections in the yaml:
54+
We can use our 3.1 evals dataset as the source dataset and the corresponding subset and define the test split to latest. For example, if we want to reproduce the MMLU_Pro metric for 3.1 8B instruct, the following configs are needed as explained below:
55+
56+
57+
**NOTE**: Config files for Meta-Llama-3.1-8B-Instruct are already provided in each task subfolder under [meta_template folder](./meta_template/).
4358

4459
```yaml
4560
task: meta_mmlu_pro_instruct
@@ -48,10 +63,9 @@ dataset_name: Meta-Llama-3.1-8B-Instruct-evals__mmlu_pro__details
4863
test_split: latest
4964
```
5065
66+
**Note**: Remember to change the eval dataset name according to the model type and DO NOT use pretrained evals dataset on instruct models or vice versa.
5167
52-
**Note**:Remember to change the eval dataset name according to the model type and DO NOT use pretrained evals dataset on instruct models or vice versa.
53-
54-
**2.Define the config for preprocessing, prompts and ground truth**
68+
**2.Configure for preprocessing, prompts and ground truth**
5569
5670
Here is the example yaml snippet in the MMLU-Pro that handles dataset preprocess, prompts and ground truth.
5771
@@ -67,7 +81,7 @@ doc_to_target: gold
6781

6882
- `doc_to_target` Defines the ground truth, which in the MMLU-Pro case, is the `gold` that derived from input_correct_responses[0].
6983

70-
**3.Define task type and parser**
84+
**3.Configure task type and parser**
7185

7286
While Open LLM Leaderboard v2 uses [multiple choice format](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#multiple-choice-format) for MMLU-Pro, BBH, GPQA tasks by comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ], we use generative task option, by asking the model to generate response in sentences given our carefully designed prompts, then using some parsers to grab the final answer, and scoring that final answer based on the ground truth. Here is a example config in the MMLU-Pro that enable the generative task and defines the regex parser:
7387

@@ -115,15 +129,23 @@ By default, for the generative tasks, the `lm-eval --model_args="{...}" --batch_
115129

116130
**NOTE**: Since our prompts in the evals dataset has already included all the special tokens required by instruct model, such as `|start_header_id|>user<|end_header_id|>`, we will not use `--apply_chat_template` argument for instruct models anymore. However, we need to use `add_bos_token=True` flag to add the BOS_token back during VLLM inference, as the BOS_token is removed by default in [this PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465).
117131

118-
We create [eval_config.yaml](./eval_config.yaml) to store all the arguments and hyperparameters. Remember to adjust the `tensor_parallel_size` to 2 or more to load the 70B models and change the `data_parallel_size` accordingly so that `tensor_parallel_size X data_parallel_size` is the number of GPUs. Please read the comments inside this yaml for detailed explanations on other parameters. Then we can run a [meta_eval.py](meta_eval.py) that reads the configuration from [eval_config.yaml](./eval_config.yaml), copies everything in the template folder to a working folder `work_dir`, makes modification to those templates accordingly, prepares dataset if needed, run specified tasks and save the eval results to default `eval_results` folder.
132+
We create [eval_config.yaml](./eval_config.yaml) to store all the arguments and hyperparameters. Remember to adjust the `tensor_parallel_size` to 2 or more to load the 70B models and change the `data_parallel_size` accordingly so that `tensor_parallel_size X data_parallel_size` is the number of GPUs. Please read the comments inside this yaml for detailed explanations on other parameters. Then we can run a [prepare_meta_eval.py](./prepare_meta_eval.py) that reads the configuration from [eval_config.yaml](./eval_config.yaml), copies everything in the template folder to a working folder `work_dir`, makes modification to those templates accordingly, prepares dataset if needed and print out the CLI command to run using `lm_eval`.
133+
134+
To run the [prepare_meta_eval.py](./prepare_meta_eval.py), we can do:
135+
136+
```
137+
python prepare_meta_eval.py --config_path ./eval_config.yaml
138+
```
139+
140+
This will load the default [eval_config.yaml](./eval_config.yaml) config and print out a CLI command to run `meta_instruct` group tasks, which includes `meta_ifeval`, `meta_math_hard`, `meta_gpqa` and `meta_mmlu_pro_instruct`, for `meta-llama/Meta-Llama-3.1-8B-Instruct` model using `meta-llama/Meta-Llama-3.1-8B-Instruct-evals` dataset using `lm_eval`.
119141

120-
To run the [meta_eval.py](meta_eval.py), we can do:
142+
An example output from [prepare_meta_eval.py](./prepare_meta_eval.py) looks like this:
121143

122144
```
123-
python meta_eval.py --config_path ./eval_config.yaml
145+
lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=4,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path ./work_dir --seed 42 --log_samples
124146
```
125147

126-
This will load the default [eval_config.yaml](./eval_config.yaml) config and run a `meta_instruct` group tasks that includes `meta_ifeval`, `meta_math_hard`, `meta_gpqa` and `meta_mmlu_pro_instruct` tasks for `meta-llama/Meta-Llama-3.1-8B-Instruct` model using `meta-llama/Meta-Llama-3.1-8B-Instruct-evals` dataset.
148+
Then just copy this command back to your terminal and run it to get our reproduced result, saved into `eval_results` folder by default.
127149

128150
**NOTE**: For `meta_math_hard` tasks, some of our internal math ground truth has been converted to scientific notation, e.g. `6\sqrt{7}` has been converted to `1.59e+1`, which will be later handled by our internal math evaluation functions. As the lm-evaluation-harness [math evaluation utils.py](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/leaderboard/math/utils.py) can not fully handle those conversion, we will use the original ground truth from the original dataset [lighteval/MATH-Hard](https://huggingface.co/datasets/lighteval/MATH-Hard) by joining the tables on the input questions. The `get_math_data` function in the [prepare_datasets.py](./prepare_dataset.py) will handle this step and produce a local parquet dataset file.
129151

@@ -143,7 +165,7 @@ Here is the comparison between our reported numbers and the reproduced numbers i
143165
| Model | MATH_HARD | GPQA_RAW | MMLU_PRO_RAW | IFeval |
144166
|------------------------------|-----------|----------|--------------|---------|
145167
| 3.1 8B-Instruct reported | 0.254 | 0.328 | 0.47 | 0.804 |
146-
| 3.1 8B-Instruct reproduced | 0.2417 | 0.3125 | 0.4675 | 0.7782 |
168+
| 3.1 8B-Instruct reproduced | 0.2424 | 0.3259 | 0.4675 | 0.7782 |
147169
| 3.1 70B-Instruct reported | 0.438 | 0.467 | 0.651 | 0.875 |
148170
| 3.1 70B-Instruct reproduced | 0.4388 | 0.4799 | 0.6475 | 0.848 |
149171

@@ -154,8 +176,9 @@ Here is the comparison between our reported numbers and the reproduced numbers i
154176
| 3.1 70B reported | 0.816 | 0.52 |
155177
| 3.1 70B reproduced | 0.8191 | 0.5225 |
156178

157-
From the table above, we can see that most of our reported results are very close to our reported number in the [Meta Llama website](https://llama.meta.com/).
179+
From the table above, we can see that most of our reproduced results are very close to our reported number in the [Meta Llama website](https://llama.meta.com/).
158180

181+
**NOTE**: In the [Meta Llama website](https://llama.meta.com/), we reported the `macro_avg` metric, which is the average of all subtask's average score, for `MMLU-Pro `task, but here we are reproducing the `micro_avg` metric, which is the average score for all the individual samples, and those `micro_avg` numbers can be found in the [eval_details.md](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro).
159182

160183
**NOTE**: The reproduced numbers may be slightly different, as we observed around ±0.01 differences between each reproduce run because the latest VLLM inference is not very deterministic even with temperature=0. This behavior maybe related [this issue](https://github.com/vllm-project/vllm/issues/5404).
161184
or it is expected as stated in [this comment](https://github.com/vllm-project/vllm/issues/4112#issuecomment-2071115725)

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/eval_config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
model_name: "meta-llama/Meta-Llama-3.1-Instruct-8B" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
1+
model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
22

33
evals_dataset: "meta-llama/Meta-Llama-3.1-8B-Instruct-evals" # The name of the 3.1 evals dataset to evaluate, please make sure this eval dataset corresponds to the model loaded. This must be a valid Meta Llama 3.1 evals dataset name in the Llama 3.1 Evals collection.
44
# Must be one of the following ["meta-llama/Meta-Llama-3.1-8B-Instruct-evals","meta-llama/Meta-Llama-3.1-70B-Instruct-evals","meta-llama/Meta-Llama-3.1-405B-Instruct-evals","meta-llama/Meta-Llama-3.1-8B-evals","meta-llama/Meta-Llama-3.1-70B-evals","meta-llama/Meta-Llama-3.1-405B-evals"]
@@ -19,7 +19,7 @@ batch_size: "auto" # Batch size, can be 'auto', 'auto:N', or an integer. It is s
1919
output_path: "eval_results" # the output folder to store all the eval results and samples.
2020

2121
#limit: 12 # Limit number of examples per task, set 'null' to run all.
22-
limit: null # Limit number of examples per task.
22+
limit: null # Limit number of examples per task, set 'null' to run all.
2323

2424
verbosity: "INFO" #Logging level: CRITICAL, ERROR, WARNING, INFO, DEBUG.
2525

0 commit comments

Comments
 (0)