Skip to content

Commit eef8b88

Browse files
committed
changed readme and added more comments
1 parent 9b8d6aa commit eef8b88

File tree

2 files changed

+22
-30
lines changed

2 files changed

+22
-30
lines changed

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ As Meta Llama models gain popularity, evaluating these models has become increas
66
## Important Notes
77

88
1. **This tutorial is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, and the implementation may differ slightly from our internal evaluation, leading to minor differences in the reproduced numbers.
9-
2. **Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Meta Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|`. It will not work with models that are not based on Llama 3.
9+
2. **Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Meta Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|>`. It will not work with models that are not based on Llama 3.
1010

1111

1212
### Hugging Face setups
@@ -39,12 +39,12 @@ Here, we aim to reproduce the Meta reported benchmark numbers on the aforementio
3939

4040
There are 4 major differences in terms of the eval configurations and prompts between this tutorial implementation and Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
4141

42-
- **Prompts**: We use Chain-of-Thought(COT) prompts while Hugging Face leaderboard does not. The prompts that define the output format are also sometime different.
42+
- **Prompts**: We use Chain-of-Thought(COT) prompts while Hugging Face leaderboard does not. The prompts that define the output format are also different.
4343
- **Task type**: For MMLU-Pro, BBH, GPQA tasks, we ask the model to generate response and score the parsed answer from generated response, while Hugging Face leaderboard evaluation is comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ].
44-
- **Parsers**: For generative tasks, where the final answer needs to be parsed before scoring, the parser functions can be different between ours and Hugging Face leaderboard evaluation, as our prompts that define the model output format are sometime designed differently.
44+
- **Parsers**: For generative tasks, where the final answer needs to be parsed before scoring, the parser functions can be different between ours and Hugging Face leaderboard evaluation, as our prompts that define the model output format are designed differently.
4545
- **Inference**: We use internal LLM inference solution that loads pytorch checkpoints and do not use padding, while Hugging Face leaderboard uses Hugging Face format model and sometimes will use padding depending on the tasks type and batch size.
4646

47-
Given those differences, our reproduced number can not be apple to apple compared to the numbers in the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
47+
Given those differences, our reproduced number can not be compared to the numbers in the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
4848

4949
### Create task config
5050

@@ -61,6 +61,8 @@ dataset_name: Meta-Llama-3.1-8B-Instruct-evals__mmlu_pro__details
6161
test_split: latest
6262
```
6363
64+
If you want to run evaluation on 70B-Instruct, then it is recommended to change the `dataset_path` and `dataset_name` from 8B to 70B, even though 70B-instruct and 8B-instruct share the same prompts, the `is_correct` column, which can be used to get the difference between current reproduced result and the reported results for each sample, is different.
65+
6466
**Note**: Config files for Meta-Llama-3.1-8B-Instruct are already provided in each task subfolder under [meta_template folder](./meta_template/). Remember to change the eval dataset name according to the model type and DO NOT use pretrained evals dataset on instruct models or vice versa.
6567

6668
**2.Configure preprocessing, prompts and ground truth**

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/prepare_meta_eval.py

Lines changed: 16 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,15 @@
11
# Copyright (c) Meta Platforms, Inc. and affiliates.
2-
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
2+
# This software may be used and distributed according to the terms of the Llama 3 Community License Agreement.
33

44
import argparse
5-
import json
6-
import logging
75
import os
8-
import re
9-
import sys
106
from pathlib import Path
117
import glob
12-
import numpy as np
13-
import lm_eval
14-
from lm_eval import tasks
15-
from lm_eval.utils import make_table
168
import shutil, errno
179
import yaml
1810
from datasets import load_dataset,Dataset
1911

12+
# get the ifeval from the evals dataset and join it with the original ifeval datasets
2013
def get_ifeval_data(model_name,output_dir):
2114
print(f"preparing the ifeval data using {model_name}'s evals dataset")
2215
if model_name not in ["Meta-Llama-3.1-8B-Instruct","Meta-Llama-3.1-70B-Instruct","Meta-Llama-3.1-405B-Instruct"]:
@@ -36,16 +29,16 @@ def get_ifeval_data(model_name,output_dir):
3629
meta_df = meta_data.to_pandas()
3730
ifeval_df = ifeval_data.to_pandas()
3831
ifeval_df = ifeval_df.rename(columns={"prompt": "input_question"})
39-
32+
# join the two datasets on the input_question column
4033
joined = meta_df.join(ifeval_df.set_index('input_question'),on="input_question")
4134
joined = joined.rename(columns={"input_final_prompts": "prompt"})
4235
joined = joined.rename(columns={"is_correct": "previous_is_correct"})
4336
joined = Dataset.from_pandas(joined)
4437
joined = joined.select_columns(["input_question", "prompt", "previous_is_correct","instruction_id_list","kwargs","output_prediction_text","key"])
4538
joined.rename_column("output_prediction_text","previous_output_prediction_text")
46-
for item in joined:
47-
check_sample(item)
4839
joined.to_parquet(output_dir + f"/joined_ifeval.parquet")
40+
41+
# get the math_hard data from the evals dataset and join it with the original math_hard dataset
4942
def get_math_data(model_name,output_dir):
5043
print(f"preparing the math data using {model_name}'s evals dataset")
5144
if model_name not in ["Meta-Llama-3.1-8B-Instruct","Meta-Llama-3.1-70B-Instruct","Meta-Llama-3.1-405B-Instruct"]:
@@ -64,15 +57,16 @@ def get_math_data(model_name,output_dir):
6457
meta_df = meta_data.to_pandas()
6558
math_df = math_data.to_pandas()
6659
math_df = math_df.rename(columns={"problem": "input_question"})
67-
60+
# join the two datasets on the input_question column
6861
joined = meta_df.join(math_df.set_index('input_question'),on="input_question")
6962
joined = Dataset.from_pandas(joined)
7063
joined = joined.select_columns(["input_question", "input_correct_responses", "input_final_prompts", "is_correct","solution","output_prediction_text"])
7164
joined = joined.rename_column("is_correct","previous_is_correct")
7265
joined = joined.rename_column("output_prediction_text","previous_output_prediction_text")
73-
for item in joined:
74-
check_sample(item)
66+
7567
joined.to_parquet(output_dir + f"/joined_math.parquet")
68+
69+
# get the question from the ifeval dataset
7670
def get_question(example):
7771
try:
7872
example["input_question"] = eval(example["input_question"].replace("null","None").replace("true","True").replace("false","False"))["dialog"][0]["body"].replace("Is it True that the first song","Is it true that the first song").replace("Is the following True","Is the following true")
@@ -81,15 +75,8 @@ def get_question(example):
8175
except:
8276
print(example["input_question"])
8377
return
84-
def check_sample(example):
85-
if "kwargs" in example and not example["kwargs"]:
86-
print(example)
87-
raise ValueError("This example did not got joined for IFeval")
88-
if "solution" in example and not example["solution"]:
89-
print(example)
90-
raise ValueError("This example did not got joined for MATH_hard")
91-
9278

79+
# change the yaml file to use the correct model name
9380
def change_yaml(args, base_name):
9481
for yaml_file in glob.glob(args.template_dir+'**/*/*.yaml', recursive=True):
9582
with open(yaml_file, "r") as sources:
@@ -102,6 +89,7 @@ def change_yaml(args, base_name):
10289
for line in lines:
10390
output.write(line.replace("Meta-Llama-3.1-8B",base_name).replace("WORK_DIR",str(yaml_dir)))
10491

92+
# copy the files and change the yaml file to use the correct model name
10593
def copy_and_prepare(args):
10694
if not os.path.exists(args.work_dir):
10795
# Copy the all files, including yaml files and python files, from template folder to the work folder
@@ -137,14 +125,15 @@ def prepare_datasets(args):
137125
get_ifeval_data(model_name,args.work_dir)
138126
if "meta_math_hard" in task_list:
139127
get_math_data(model_name,args.work_dir)
140-
128+
# copy the files from src to dst
141129
def copy_dir(src, dst):
142130
try:
143131
shutil.copytree(src, dst)
144132
except OSError as exc: # python >2.5
145133
if exc.errno in (errno.ENOTDIR, errno.EINVAL):
146134
shutil.copy(src, dst)
147135
else: raise
136+
# load the config yaml file
148137
def load_config(config_path: str = "./config.yaml"):
149138
# Read the YAML configuration file
150139
with open(config_path, "r") as file:
@@ -163,14 +152,15 @@ def load_config(config_path: str = "./config.yaml"):
163152
args.model_args = f"pretrained={args.model_name},tensor_parallel_size={args.tensor_parallel_size},dtype=auto,gpu_memory_utilization={args.gpu_memory_utilization},data_parallel_size={args.data_parallel_size},max_model_len={args.max_model_len},add_bos_token=True,seed=42"
164153
# Copy the all files from template folder to the work folder
165154
copy_and_prepare(args)
155+
# Prepare the datasets for the IFeval and MATH_Hard tasks as we need to join the original dataset
166156
prepare_datasets(args)
167157
print(f"prepration for the {args.model_name} using {args.evals_dataset} is done, all saved the work_dir: {args.work_dir}")
168158
command_str = f"lm_eval --model vllm --model_args {args.model_args} --tasks {args.tasks} --batch_size auto --output_path { args.output_path} --include_path {os.path.abspath(args.work_dir)} --seed 42 "
169159
if args.limit:
170160
command_str += f" --limit {args.limit}"
171161
if args.log_samples:
172-
command_str += f" --log_samples "
162+
command_str += " --log_samples "
173163
if args.show_config:
174-
command_str += f" --show_config "
164+
command_str += " --show_config "
175165
print("please use the following command to run the meta reproduce evals:")
176166
print(command_str)

0 commit comments

Comments
 (0)