Skip to content

Commit 576e574

Browse files
committed
update readme
1 parent a7b4492 commit 576e574

File tree

17 files changed

+26
-26
lines changed

17 files changed

+26
-26
lines changed

tools/benchmarks/llm_eval_harness/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -141,9 +141,9 @@ vLLM occasionally differs in output from Huggingface. `lm-evaluation-harness` tr
141141
142142
For more details about `lm-evaluation-harness`, please visit checkout their github repo [README.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/README.md).
143143

144-
## Reproducing Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
144+
## Calculating Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
145145

146-
[meta_eval_reproduce](./meta_eval_reproduce/) folder provides a detailed guide on how to reproduce the Meta Llama 3.1 evaluation metrics reported in our [Meta Llama website](https://llama.meta.com/) using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) and our [3.1 evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). By following the steps outlined, users can replicate a evaluation process that is similar to Meta's, for specific tasks and compare their results with our reported metrics. While slight variations in results are expected due to differences in implementation and model behavior, we aim to provide a transparent and reproducible method for evaluating Meta Llama 3 models using third party library. Please check the [README.md](./meta_eval_reproduce/README.md) for more details.
146+
[meta_eval](./meta_eval/) folder provides a detailed guide on how to calculate the Meta Llama 3.1 evaluation metrics reported in our [Meta Llama website](https://llama.meta.com/) using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) and our [3.1 evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). By following the steps outlined, users can replicate a evaluation process that is similar to Meta's, for specific tasks and compare their results with our reported metrics. While slight variations in results are expected due to differences in implementation and model behavior, we aim to provide a transparent method for evaluating Meta Llama 3 models using third party library. Please check the [README.md](./meta_eval/README.md) for more details.
147147

148148
## Reproducing HuggingFace Open-LLM-Leaderboard v2
149149

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md renamed to tools/benchmarks/llm_eval_harness/meta_eval/README.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11

2-
# Reproducing Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
2+
# Calculating Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
33

4-
As Meta Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Meta-Llama 3.1 models as datasets in the [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This recipe demonstrates how to closely reproduce the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks.
4+
As Meta Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Meta-Llama 3.1 models as datasets in the [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This recipe demonstrates how to calculate the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks.
55

66
## Disclaimer
77

88

9-
1. **This recipe is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, as this implementation is not mirroring Meta Llama evaluation, this may lead to minor differences in the reproduced numbers.
9+
1. **This recipe is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, as this implementation is not mirroring Meta Llama evaluation, this may lead to minor differences in the produced numbers.
1010
2. **Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Meta Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|>`. It will not work with models that are not based on Llama 3.
1111

1212
## Insights from Our Evaluation Process
@@ -19,19 +19,19 @@ Here are our insights about the differences in terms of the eval configurations
1919
- **Inference**: We use an internal LLM inference solution that does not apply padding, while Hugging Face leaderboard uses padding on the generative tasks (MATH and IFEVAL).
2020
- **Tasks** We run benchmarks on BBH and MMLU-Pro only for pretrained models and Math-Hard, IFeval, GPQA, only for pretrained models.
2121

22-
Given those differences, our reproduced number can not be compared to the numbers in the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
22+
Given those differences, the numbers from this recipe can not be compared to the numbers in the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
2323

2424
## Environment setups
2525

2626
Please install lm-evaluation-harness and our llama-recipe repo by following:
2727

2828
```
29-
pip install lm-eval[math,ifeval,sentencepiece,vllm]==0.4.3
3029
git clone [email protected]:meta-llama/llama-recipes.git
3130
cd llama-recipes
3231
pip install -U pip setuptools
3332
pip install -e .
34-
cd tools/benchmarks/llm_eval_harness/meta_eval_reproduce
33+
pip install lm-eval[math,ifeval,sentencepiece,vllm]==0.4.3
34+
cd tools/benchmarks/llm_eval_harness/meta_eval
3535
```
3636

3737
To access our [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f), you must:
@@ -47,7 +47,7 @@ Given the extensive number of tasks available (12 for pretrained models and 30 f
4747
- **Tasks for pretrained models**: BBH and MMLU-Pro
4848
- **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
4949

50-
Here, we aim to reproduce the Meta reported benchmark numbers on the aforementioned tasks using Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and reproduce our reported metrics.
50+
Here, we aim to get the benchmark numbers on the aforementioned tasks using Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and get more eval metrics.
5151

5252

5353
### Run eval tasks
@@ -91,7 +91,7 @@ python prepare_meta_eval.py --config_path ./eval_config.yaml
9191
lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=4,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path ./work_dir --seed 42 --log_samples
9292
```
9393

94-
4. Then just copy the `lm_eval` command printed by [prepare_meta_eval.py](./prepare_meta_eval.py) back to your terminal and run it to get our reproduced result, which will be saved into `eval_results` folder by default.
94+
4. Then just copy the `lm_eval` command printed by [prepare_meta_eval.py](./prepare_meta_eval.py) back to your terminal and run it to get the result, which will be saved into `eval_results` folder by default.
9595

9696
**NOTE**: As for `--model vllm`, here we will use VLLM inference instead of Hugging Face inference because of the padding issue. By default, for the generative tasks, the `lm-eval --model_args="{...}" --batch_size=auto` command will use Hugging Face inference solution that uses a static batch method with [left padding](https://github.com/EleutherAI/lm-evaluation-harness/blob/8ad598dfd305ece8c6c05062044442d207279a97/lm_eval/models/huggingface.py#L773) using EOS_token for Llama models, but our internal evaluation will load python original checkpoints and handle individual generation request asynchronously without any padding. To simulate this, we will use VLLM inference solution to do dynamic batching without any padding.
9797

@@ -115,7 +115,7 @@ Here, we will use MMLU-Pro as a example to show the steps to create a yaml confi
115115

116116
**1.Define the config to load datasets**
117117

118-
We can use our 3.1 evals dataset as the source dataset and the corresponding subset and define the test split to latest. For example, if we want to reproduce the MMLU_Pro metric for 3.1 8B instruct, the following configs are needed as explained below:
118+
We can use our 3.1 evals dataset as the source dataset and the corresponding subset and define the test split to latest. For example, if we want to calculate the MMLU_Pro metric for 3.1 8B instruct, the following configs are needed as explained below:
119119

120120
```yaml
121121
task: meta_mmlu_pro_instruct
@@ -124,7 +124,7 @@ dataset_name: Meta-Llama-3.1-8B-Instruct-evals__mmlu_pro__details
124124
test_split: latest
125125
```
126126
127-
If you want to run evaluation on 70B-Instruct, then it is recommended to change the `dataset_path` and `dataset_name` from 8B to 70B, even though 70B-instruct and 8B-instruct share the same prompts, the `is_correct` column, which can be used to get the difference between current reproduced result and the reported results for each sample, is different.
127+
If you want to run evaluation on 70B-Instruct, then it is recommended to change the `dataset_path` and `dataset_name` from 8B to 70B, even though 70B-instruct and 8B-instruct share the same prompts, the `is_correct` column, which can be used to get the difference between current result and the reported results for each sample, is different.
128128

129129
**Note**: Config files for Meta-Llama-3.1-8B-Instruct are already provided in each task subfolder under [meta_template folder](./meta_template/). Remember to change the eval dataset name according to the model type and DO NOT use pretrained evals dataset on instruct models or vice versa.
130130

@@ -138,7 +138,7 @@ doc_to_text: !function utils.doc_to_text
138138
doc_to_target: gold
139139
```
140140

141-
- `process_docs` : Defines the preprocess function for our datasets. In this case, we uses the `process_docs` python function that is defined in [utils.py](./meta_template/mmlu_pro/utils.py). This function will take the original dataset and output a processed dataset that has a out_doc, which contains `problem` which is the input question, `gold` which is the ground truth. We also renamed the `is_correct` column to `previously_is_correct` to allow detailed comparison for the difference of each sample between previously reported score and the reproduced score. You must use eval dataset and model with same parameters and same model type to get a valid comparison.
141+
- `process_docs` : Defines the preprocess function for our datasets. In this case, we uses the `process_docs` python function that is defined in [utils.py](./meta_template/mmlu_pro/utils.py). This function will take the original dataset and output a processed dataset that has a out_doc, which contains `problem` which is the input question, `gold` which is the ground truth. We also renamed the `is_correct` column to `previously_is_correct` to allow detailed comparison for the difference of each sample between previously reported score and the current score. You must use eval dataset and model with same parameters and same model type to get a valid comparison.
142142

143143
- `doc_to_text`: Defines the prompts. In the MMLU-Pro case, the `input_final_prompts` column always contains a list of a prompt, so we just use a python function that returns `input_final_prompts[0]`.
144144

@@ -178,33 +178,33 @@ metric_list:
178178
```
179179
Here we set the `num_fewshot` to 0 as our prompts have already been converted to 5-shots, and the model generation will only stop if the generated output tokens exceeds 1024, as stated in the [mmlu-pro eval details](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro). We will set the `do_sample` to false and `temperature` to 0 as stated in our `eval_config` column in the dataset. We will use metric `exact_match` for this tasks and calculate the `mean` as our task aggregated number.
180180

181-
**NOTE**: While we tried our best to create the template files, those configs and functions are created based on public third-party library and are not exactly the same as our internal implementation, so there is a chance that the reproduced numbers are slightly different.
181+
**NOTE**: While we tried our best to create the template files, those configs and functions are created based on public third-party library and are not exactly the same as our internal implementation, so there is a chance that the eval numbers are slightly different.
182182

183183
## Results
184184

185-
Here is the comparison between our reported numbers and the reproduced numbers in this tutorial:
185+
Here is the comparison between our reported numbers and the eval numbers in this tutorial:
186186

187187
| Model | MATH_HARD | GPQA_RAW | MMLU_PRO_RAW | IFeval |
188188
|------------------------------|-----------|----------|--------------|---------|
189-
| 3.1 8B-Instruct reported | 0.254 | 0.328 | 0.47 | 0.804 |
190-
| 3.1 8B-Instruct reproduced | 0.2424 | 0.3259 | 0.4675 | 0.7782 |
191-
| 3.1 70B-Instruct reported | 0.438 | 0.467 | 0.651 | 0.875 |
192-
| 3.1 70B-Instruct reproduced | 0.4388 | 0.4799 | 0.6475 | 0.848 |
189+
| 3.1 8B-Instruct(reported) | 0.254 | 0.328 | 0.47 | 0.804 |
190+
| 3.1 8B-Instruct(this) | 0.2424 | 0.3259 | 0.4675 | 0.7782 |
191+
| 3.1 70B-Instruct(reported) | 0.438 | 0.467 | 0.651 | 0.875 |
192+
| 3.1 70B-Instruct(this) | 0.4388 | 0.4799 | 0.6475 | 0.848 |
193193

194194
| Model | BBH_RAW | MMLU_PRO_RAW |
195195
|------------------------|---------|--------------|
196-
| 3.1 8B reported | 0.642 | 0.356 |
197-
| 3.1 8B reproduced | 0.6515 | 0.3572 |
198-
| 3.1 70B reported | 0.816 | 0.52 |
199-
| 3.1 70B reproduced | 0.8191 | 0.5225 |
196+
| 3.1 8B(reported) | 0.642 | 0.356 |
197+
| 3.1 8B(this) | 0.6515 | 0.3572 |
198+
| 3.1 70B(reported) | 0.816 | 0.52 |
199+
| 3.1 70B(this) | 0.8191 | 0.5225 |
200200

201-
From the table above, we can see that most of our reproduced results are very close to our reported number in the [Meta Llama website](https://llama.meta.com/).
201+
From the table above, we can see that most of our results calculated from this recipe are very close to our reported number in the [Meta Llama website](https://llama.meta.com/).
202202

203203
**NOTE**: We used the average of `inst_level_strict_acc,none` and `prompt_level_strict_acc,none` to get the final number for `IFeval` as stated [here](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#task-evaluations-and-parameters).
204204

205-
**NOTE**: In the [Meta Llama website](https://llama.meta.com/), we reported the `macro_avg` metric, which is the average of all subtask average score, for `MMLU-Pro `task, but here we are reproducing the `micro_avg` metric, which is the average score for all the individual samples, and those `micro_avg` numbers can be found in the [eval_details.md](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro).
205+
**NOTE**: In the [Meta Llama website](https://llama.meta.com/), we reported the `macro_avg` metric, which is the average of all subtask average score, for `MMLU-Pro `task, but here we are calculating the `micro_avg` metric, which is the average score for all the individual samples, and those `micro_avg` numbers can be found in the [eval_details.md](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro).
206206

207-
**NOTE**: The reproduced numbers may be slightly different, as we observed around ±0.01 differences between each reproduce run because the latest VLLM inference is not very deterministic even with temperature=0. This behavior maybe related [this issue](https://github.com/vllm-project/vllm/issues/5404).
207+
**NOTE**: The eval numbers may be slightly different, as we observed around ±0.01 differences between each evaluation run because the latest VLLM inference is not very deterministic even with temperature=0. This behavior maybe related [this issue](https://github.com/vllm-project/vllm/issues/5404).
208208
or it is expected due to 16-bits inference as stated in [this comment](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535) and [this comment](https://github.com/vllm-project/vllm/issues/4112#issuecomment-2071115725).
209209

210210
## Acknowledgement

0 commit comments

Comments
 (0)