Skip to content

Commit 19dd9dc

Browse files
committed
update instruction
1 parent c1390e8 commit 19dd9dc

File tree

1 file changed

+13
-8
lines changed
  • tools/benchmarks/llm_eval_harness/meta_eval_reproduce

1 file changed

+13
-8
lines changed

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,18 @@ Given those differences, our reproduced number can not be compared to the number
2323

2424
## Environment setups
2525

26-
Please install our llama-recipe repo and lm-evaluation-harness by following:
26+
Please install our lm-evaluation-harness and llama-recipe repo by following:
2727

2828
```
29-
pip install llama-recipes
30-
git clone https://github.com/EleutherAI/lm-evaluation-harness
29+
git clone [email protected]:EleutherAI/lm-evaluation-harness.git
3130
cd lm-evaluation-harness
3231
pip install -e .[math,ifeval,sentencepiece,vllm]
32+
cd ../
33+
git clone [email protected]:meta-llama/llama-recipes.git
34+
cd llama-recipes
35+
pip install -U pip setuptools
36+
pip install -e .
37+
cd tools/benchmarks/llm_eval_harness/meta_eval_reproduce
3338
```
3439

3540
To access our [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f), you must:
@@ -50,7 +55,7 @@ Here, we aim to reproduce the Meta reported benchmark numbers on the aforementio
5055

5156
### Run eval tasks
5257

53-
1. We create [eval_config.yaml](./eval_config.yaml) to store all the arguments and hyperparameters. This is the main config file you need to change and part of eval_config.yaml looks like this:
58+
1. We created [eval_config.yaml](./eval_config.yaml) to store all the arguments and hyperparameters. This is the main config file you need to change if you want to eval other models, and a part of eval_config.yaml looks like this:
5459

5560
```yaml
5661
model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
@@ -71,25 +76,25 @@ data_parallel_size: 4 # The VLLM argument that speicify the data parallel size f
7176

7277
Change `model_name` to the model name you want to eval on and change the `evals_dataset` according to the model type and parameters. Remember to adjust the `tensor_parallel_size` to 2 or more to load the 70B models and change the `data_parallel_size` accordingly so that `tensor_parallel_size * data_parallel_size` is the number of GPUs you have. Please read the comments inside this yaml for detailed explanations on other parameters.
7378

74-
2. We already included all the related eval task yaml and python files in the [meta_template](./meta_template/) folder, which defines all the task implementation. You do not need to change those manually, we will use [prepare_meta_eval.py](./prepare_meta_eval.py) to automatically change them later.
79+
2. We already included all the related eval task yaml and python files in the [meta_template](./meta_template/) folder, which define all the task implementation. You do not need to change those manually, we will use [prepare_meta_eval.py](./prepare_meta_eval.py) to automatically change them later.
7580

76-
3. Then we can run a [prepare_meta_eval.py](./prepare_meta_eval.py) that reads the configuration from [eval_config.yaml](./eval_config.yaml), copies everything in the template folder to a working folder `work_dir`, makes modification to those templates accordingly, prepares dataset if needed and print out the CLI command to run the `lm_eval`.
81+
3. Then we can run [prepare_meta_eval.py](./prepare_meta_eval.py) that reads the configuration from [eval_config.yaml](./eval_config.yaml), copies everything in the template folder to a working folder `work_dir`, makes modification to those templates accordingly, prepares dataset if needed and prints out the CLI command to run the `lm_eval`.
7782

7883
To run the [prepare_meta_eval.py](./prepare_meta_eval.py), we can do:
7984

8085
```
8186
python prepare_meta_eval.py --config_path ./eval_config.yaml
8287
```
8388

84-
By default,this will load the default [eval_config.yaml](./eval_config.yaml) config and print out a CLI command to run `meta_instruct` group tasks, which includes `meta_ifeval`, `meta_math_hard`, `meta_gpqa` and `meta_mmlu_pro_instruct`, for `meta-llama/Meta-Llama-3.1-8B-Instruct` model using `meta-llama/Meta-Llama-3.1-8B-Instruct-evals` dataset using `lm_eval`.
89+
By default,this will load the default [eval_config.yaml](./eval_config.yaml) config and print out a CLI command to run `meta_instruct` group tasks, which includes `meta_ifeval`, `meta_math_hard`, `meta_gpqa` and `meta_mmlu_pro_instruct`, for `meta-llama/Meta-Llama-3.1-8B-Instruct` model using `meta-llama/Meta-Llama-3.1-8B-Instruct-evals` dataset and `lm_eval`.
8590

8691
An example output from [prepare_meta_eval.py](./prepare_meta_eval.py) looks like this:
8792

8893
```
8994
lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=4,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path ./work_dir --seed 42 --log_samples
9095
```
9196

92-
4. Then just copy the command printed from [prepare_meta_eval.py](./prepare_meta_eval.py) back to your terminal and run it to get our reproduced result, saved into `eval_results` folder by default.
97+
4. Then just copy the `lm_eval` command printed by [prepare_meta_eval.py](./prepare_meta_eval.py) back to your terminal and run it to get our reproduced result, which will be saved into `eval_results` folder by default.
9398

9499
**NOTE**: As for `--model vllm`, here we will use VLLM inference instead of Hugging Face inference because of the padding issue. By default, for the generative tasks, the `lm-eval --model_args="{...}" --batch_size=auto` command will use Hugging Face inference solution that uses a static batch method with [left padding](https://github.com/EleutherAI/lm-evaluation-harness/blob/8ad598dfd305ece8c6c05062044442d207279a97/lm_eval/models/huggingface.py#L773) using EOS_token for Llama models, but our internal evaluation will load python original checkpoints and handle individual generation request asynchronously without any padding. To simulate this, we will use VLLM inference solution to do dynamic batching without any padding.
95100

0 commit comments

Comments
 (0)