Skip to content

Commit e1b7bc7

Browse files
committed
remove result section and change meta-llama 3.1 to llama 3.1
1 parent b013d27 commit e1b7bc7

File tree

7 files changed

+38
-58
lines changed

7 files changed

+38
-58
lines changed

tools/benchmarks/llm_eval_harness/meta_eval/README.md

Lines changed: 12 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11

22
# Calculating Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
33

4-
As Meta Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Meta-Llama 3.1 models as datasets in the [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This recipe demonstrates how to calculate the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks.
4+
As Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Llama 3.1 models as datasets in the [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This recipe demonstrates how to calculate the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks.
55

66
## Disclaimer
77

88

9-
1. **This recipe is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, as this implementation is not mirroring Meta Llama evaluation, this may lead to minor differences in the produced numbers.
10-
2. **Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Meta Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|>`. It will not work with models that are not based on Llama 3.
9+
1. **This recipe is not the official implementation** of Llama evaluation. It is based on public third-party libraries, as this implementation is not mirroring Llama evaluation, this may lead to minor differences in the produced numbers.
10+
2. **Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|>`. It will not work with models that are not based on Llama 3.
1111

1212
## Insights from Our Evaluation Process
1313

@@ -55,10 +55,10 @@ Here, we aim to get the benchmark numbers on the aforementioned tasks using Hugg
5555
1. We created [eval_config.yaml](./eval_config.yaml) to store all the arguments and hyperparameters. This is the main config file you need to change if you want to eval other models, and a part of eval_config.yaml looks like this:
5656

5757
```yaml
58-
model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
58+
model_name: "meta-llama/Llama-3.1-8B-Instruct" # The name of the model to evaluate. This must be a valid Llama 3 based model name in the HuggingFace model hub."
5959

60-
evals_dataset: "meta-llama/Meta-Llama-3.1-8B-Instruct-evals" # The name of the 3.1 evals dataset to evaluate, please make sure this eval dataset corresponds to the model loaded. This must be a valid Meta Llama 3.1 evals dataset name in the Llama 3.1 Evals collection.
61-
# Must be one of the following ["meta-llama/Meta-Llama-3.1-8B-Instruct-evals","meta-llama/Meta-Llama-3.1-70B-Instruct-evals","meta-llama/Meta-Llama-3.1-405B-Instruct-evals","meta-llama/Meta-Llama-3.1-8B-evals","meta-llama/Meta-Llama-3.1-70B-evals","meta-llama/Meta-Llama-3.1-405B-evals"]
60+
evals_dataset: "meta-llama/Llama-3.1-8B-Instruct-evals" # The name of the 3.1 evals dataset to evaluate, please make sure this eval dataset corresponds to the model loaded. This must be a valid Llama 3.1 evals dataset name in the Llama 3.1 Evals collection.
61+
# Must be one of the following ["meta-llama/Llama-3.1-8B-Instruct-evals","meta-llama/Llama-3.1-70B-Instruct-evals","meta-llama/Llama-3.1-405B-Instruct-evals","meta-llama/Llama-3.1-8B-evals","meta-llama/Llama-3.1-70B-evals","meta-llama/Llama-3.1-405B-evals"]
6262

6363
tasks: "meta_instruct" # Available tasks for instruct model: "meta_math_hard", "meta_gpqa", "meta_mmlu_pro_instruct", "meta_ifeval"; or just use "meta_instruct" to run all of them.
6464
# Available tasks for pretrain model: "meta_bbh", "meta_mmlu_pro_pretrain"; or just use "meta_pretrain" to run all of them.
@@ -83,12 +83,12 @@ data_parallel_size: 4 # The VLLM argument that speicify the data parallel size f
8383
python prepare_meta_eval.py --config_path ./eval_config.yaml
8484
```
8585

86-
This script will load the default [eval_config.yaml](./eval_config.yaml) config and print out a `lm_eval` command to run `meta_instruct` group tasks, which includes `meta_ifeval`, `meta_math_hard`, `meta_gpqa` and `meta_mmlu_pro_instruct`, for `meta-llama/Meta-Llama-3.1-8B-Instruct` model using `meta-llama/Meta-Llama-3.1-8B-Instruct-evals` dataset.
86+
This script will load the default [eval_config.yaml](./eval_config.yaml) config and print out a `lm_eval` command to run `meta_instruct` group tasks, which includes `meta_ifeval`, `meta_math_hard`, `meta_gpqa` and `meta_mmlu_pro_instruct`, for `meta-llama/Llama-3.1-8B-Instruct` model using `meta-llama/Llama-3.1-8B-Instruct-evals` dataset.
8787

8888
An example output from [prepare_meta_eval.py](./prepare_meta_eval.py) looks like this:
8989

9090
```
91-
lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=4,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path ./work_dir --seed 42 --log_samples
91+
lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=4,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path ./work_dir --seed 42 --log_samples
9292
```
9393

9494
4. Then just copy the `lm_eval` command printed by [prepare_meta_eval.py](./prepare_meta_eval.py) back to your terminal and run it to get the result, which will be saved into `eval_results` folder by default.
@@ -119,14 +119,14 @@ We can use our 3.1 evals dataset as the source dataset and the corresponding sub
119119

120120
```yaml
121121
task: meta_mmlu_pro_instruct
122-
dataset_path: meta-llama/Meta-Llama-3.1-8B-Instruct-evals
123-
dataset_name: Meta-Llama-3.1-8B-Instruct-evals__mmlu_pro__details
122+
dataset_path: meta-llama/Llama-3.1-8B-Instruct-evals
123+
dataset_name: Llama-3.1-8B-Instruct-evals__mmlu_pro__details
124124
test_split: latest
125125
```
126126
127127
If you want to run evaluation on 70B-Instruct, then it is recommended to change the `dataset_path` and `dataset_name` from 8B to 70B, even though 70B-instruct and 8B-instruct share the same prompts, the `is_correct` column, which can be used to get the difference between current result and the reported results for each sample, is different.
128128

129-
**Note**: Config files for Meta-Llama-3.1-8B-Instruct are already provided in each task subfolder under [meta_template folder](./meta_template/). Remember to change the eval dataset name according to the model type and DO NOT use pretrained evals dataset on instruct models or vice versa.
129+
**Note**: Config files for Llama-3.1-8B-Instruct are already provided in each task subfolder under [meta_template folder](./meta_template/). Remember to change the eval dataset name according to the model type and DO NOT use pretrained evals dataset on instruct models or vice versa.
130130

131131
**2.Configure preprocessing, prompts and ground truth**
132132

@@ -180,29 +180,9 @@ Here we set the `num_fewshot` to 0 as our prompts have already been converted to
180180

181181
**NOTE**: While we tried our best to create the template files, those configs and functions are created based on public third-party library and are not exactly the same as our internal implementation, so there is a chance that the eval numbers are slightly different.
182182

183-
## Results
184-
185-
Here is the comparison between our reported numbers and the eval numbers in this tutorial:
186-
187-
| Model | MATH_HARD | GPQA_RAW | MMLU_PRO_RAW | IFeval |
188-
|------------------------------|-----------|----------|--------------|---------|
189-
| 3.1 8B-Instruct(reported) | 0.254 | 0.328 | 0.47 | 0.804 |
190-
| 3.1 8B-Instruct(this) | 0.2424 | 0.3259 | 0.4675 | 0.7782 |
191-
| 3.1 70B-Instruct(reported) | 0.438 | 0.467 | 0.651 | 0.875 |
192-
| 3.1 70B-Instruct(this) | 0.4388 | 0.4799 | 0.6475 | 0.848 |
193-
194-
| Model | BBH_RAW | MMLU_PRO_RAW |
195-
|------------------------|---------|--------------|
196-
| 3.1 8B(reported) | 0.642 | 0.356 |
197-
| 3.1 8B(this) | 0.6515 | 0.3572 |
198-
| 3.1 70B(reported) | 0.816 | 0.52 |
199-
| 3.1 70B(this) | 0.8191 | 0.5225 |
200-
201-
From the table above, we can see that most of our results calculated from this recipe are very close to our reported number in the [Meta Llama website](https://llama.meta.com/).
202-
203183
**NOTE**: We used the average of `inst_level_strict_acc,none` and `prompt_level_strict_acc,none` to get the final number for `IFeval` as stated [here](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#task-evaluations-and-parameters).
204184

205-
**NOTE**: In the [Meta Llama website](https://llama.meta.com/), we reported the `macro_avg` metric, which is the average of all subtask average score, for `MMLU-Pro `task, but here we are calculating the `micro_avg` metric, which is the average score for all the individual samples, and those `micro_avg` numbers can be found in the [eval_details.md](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro).
185+
**NOTE**: In the [Llama website](https://llama.com/), we reported the `macro_avg` metric, which is the average of all subtask average score, for `MMLU-Pro `task, but here we are calculating the `micro_avg` metric, which is the average score for all the individual samples, and those `micro_avg` numbers can be found in the [eval_details.md](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro).
206186

207187
**NOTE**: The eval numbers may be slightly different, as we observed around ±0.01 differences between each evaluation run because the latest VLLM inference is not very deterministic even with temperature=0. This behavior maybe related [this issue](https://github.com/vllm-project/vllm/issues/5404).
208188
or it is expected due to 16-bits inference as stated in [this comment](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535) and [this comment](https://github.com/vllm-project/vllm/issues/4112#issuecomment-2071115725).

tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
1+
model_name: "meta-llama/Llama-3.1-8B-Instruct" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
22

3-
evals_dataset: "meta-llama/Meta-Llama-3.1-8B-Instruct-evals" # The name of the 3.1 evals dataset to evaluate, please make sure this eval dataset corresponds to the model loaded. This must be a valid Meta Llama 3.1 evals dataset name in the Llama 3.1 Evals collection.
4-
# Must be one of the following ["meta-llama/Meta-Llama-3.1-8B-Instruct-evals","meta-llama/Meta-Llama-3.1-70B-Instruct-evals","meta-llama/Meta-Llama-3.1-405B-Instruct-evals","meta-llama/Meta-Llama-3.1-8B-evals","meta-llama/Meta-Llama-3.1-70B-evals","meta-llama/Meta-Llama-3.1-405B-evals"]
3+
evals_dataset: "meta-llama/Llama-3.1-8B-Instruct-evals" # The name of the 3.1 evals dataset to evaluate, please make sure this eval dataset corresponds to the model loaded. This must be a valid Meta Llama 3.1 evals dataset name in the Llama 3.1 Evals collection.
4+
# Must be one of the following ["meta-llama/Llama-3.1-8B-Instruct-evals","meta-llama/Llama-3.1-70B-Instruct-evals","meta-llama/Llama-3.1-405B-Instruct-evals","meta-llama/Llama-3.1-8B-evals","meta-llama/Llama-3.1-70B-evals","meta-llama/Llama-3.1-405B-evals"]
55

66
tasks: "meta_instruct" # Available tasks for instruct model: "meta_math_hard", "meta_gpqa", "meta_mmlu_pro_instruct", "meta_ifeval"; or just use "meta_instruct" to run all of them.
77
# Available tasks for pretrain model: "meta_bbh", "meta_mmlu_pro_pretrain"; or just use "meta_pretrain" to run all of them.

tools/benchmarks/llm_eval_harness/meta_eval/meta_template/bbh/bbh_3shot_cot.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
dataset_path: meta-llama/Meta-Llama-3.1-8B-evals
2-
dataset_name: Meta-Llama-3.1-8B-evals__bbh__details
1+
dataset_path: meta-llama/Llama-3.1-8B-evals
2+
dataset_name: Llama-3.1-8B-evals__bbh__details
33
task: meta_bbh
44
output_type: generate_until
55
process_docs: !function utils.process_docs

tools/benchmarks/llm_eval_harness/meta_eval/meta_template/gpqa_cot/gpqa_0shot_cot.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
dataset_path: meta-llama/Meta-Llama-3.1-8B-Instruct-evals
2-
dataset_name: Meta-Llama-3.1-8B-Instruct-evals__gpqa__details
1+
dataset_path: meta-llama/Llama-3.1-8B-Instruct-evals
2+
dataset_name: Llama-3.1-8B-Instruct-evals__gpqa__details
33
task: meta_gpqa
44
output_type: generate_until
55
process_docs: !function utils.process_docs

tools/benchmarks/llm_eval_harness/meta_eval/meta_template/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
task: meta_mmlu_pro_instruct
2-
dataset_path: meta-llama/Meta-Llama-3.1-8B-Instruct-evals
3-
dataset_name: Meta-Llama-3.1-8B-Instruct-evals__mmlu_pro__details
2+
dataset_path: meta-llama/Llama-3.1-8B-Instruct-evals
3+
dataset_name: Llama-3.1-8B-Instruct-evals__mmlu_pro__details
44
test_split: latest
55
output_type: generate_until
66
process_docs: !function utils.process_docs

tools/benchmarks/llm_eval_harness/meta_eval/meta_template/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
task: meta_mmlu_pro_pretrain
2-
dataset_path: meta-llama/Meta-Llama-3.1-8B-evals
3-
dataset_name: Meta-Llama-3.1-8B-evals__mmlu_pro__details
2+
dataset_path: meta-llama/Llama-3.1-8B-evals
3+
dataset_name: Llama-3.1-8B-evals__mmlu_pro__details
44
test_split: latest
55
output_type: generate_until
66
process_docs: !function utils.process_docs

tools/benchmarks/llm_eval_harness/meta_eval/prepare_meta_eval.py

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,12 @@
1616
def get_ifeval_data(model_name, output_dir):
1717
print(f"preparing the ifeval data using {model_name}'s evals dataset")
1818
if model_name not in [
19-
"Meta-Llama-3.1-8B-Instruct",
20-
"Meta-Llama-3.1-70B-Instruct",
21-
"Meta-Llama-3.1-405B-Instruct",
19+
"Llama-3.1-8B-Instruct",
20+
"Llama-3.1-70B-Instruct",
21+
"Llama-3.1-405B-Instruct",
2222
]:
2323
raise ValueError(
24-
"Only Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.1-70B-Instruct, Meta-Llama-3.1-405B-Instruct models are supported for IFEval"
24+
"Only Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct models are supported for IFEval"
2525
)
2626
original_dataset_name = "wis-k/instruction-following-eval"
2727
meta_dataset_name = f"meta-llama/{model_name}-evals"
@@ -59,12 +59,12 @@ def get_ifeval_data(model_name, output_dir):
5959
def get_math_data(model_name, output_dir):
6060
print(f"preparing the math data using {model_name}'s evals dataset")
6161
if model_name not in [
62-
"Meta-Llama-3.1-8B-Instruct",
63-
"Meta-Llama-3.1-70B-Instruct",
64-
"Meta-Llama-3.1-405B-Instruct",
62+
"Llama-3.1-8B-Instruct",
63+
"Llama-3.1-70B-Instruct",
64+
"Llama-3.1-405B-Instruct",
6565
]:
6666
raise ValueError(
67-
"Only Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.1-70B-Instruct, Meta-Llama-3.1-405B-Instruct models are supported for MATH_hard"
67+
"Only Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct models are supported for MATH_hard"
6868
)
6969
original_dataset_name = "lighteval/MATH-Hard"
7070
meta_dataset_name = f"meta-llama/{model_name}-evals"
@@ -130,7 +130,7 @@ def change_yaml(args, base_name):
130130
with open(output_path, "w") as output:
131131
for line in lines:
132132
output.write(
133-
line.replace("Meta-Llama-3.1-8B", base_name).replace(
133+
line.replace("Llama-3.1-8B", base_name).replace(
134134
"WORK_DIR", str(yaml_dir)
135135
)
136136
)
@@ -208,12 +208,12 @@ def load_config(config_path: str = "./config.yaml"):
208208
if not os.path.exists(args.template_dir):
209209
raise ValueError("The template_dir does not exist, please check the path")
210210
if args.evals_dataset not in [
211-
"meta-llama/Meta-Llama-3.1-8B-Instruct-evals",
212-
"meta-llama/Meta-Llama-3.1-70B-Instruct-evals",
213-
"meta-llama/Meta-Llama-3.1-405B-Instruct-evals",
214-
"meta-llama/Meta-Llama-3.1-8B-evals",
215-
"meta-llama/Meta-Llama-3.1-70B-evals",
216-
"meta-llama/Meta-Llama-3.1-405B-evals",
211+
"meta-llama/Llama-3.1-8B-Instruct-evals",
212+
"meta-llama/Llama-3.1-70B-Instruct-evals",
213+
"meta-llama/Llama-3.1-405B-Instruct-evals",
214+
"meta-llama/Llama-3.1-8B-evals",
215+
"meta-llama/Llama-3.1-70B-evals",
216+
"meta-llama/Llama-3.1-405B-evals",
217217
]:
218218
raise ValueError(
219219
"The evals dataset is not valid, please double check the name, must use the name in the Llama 3.1 Evals collection"

0 commit comments

Comments
 (0)