You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cd tools/benchmarks/llm_eval_harness/meta_eval_reproduce
33
38
```
34
39
35
40
To access our [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f), you must:
@@ -50,7 +55,7 @@ Here, we aim to reproduce the Meta reported benchmark numbers on the aforementio
50
55
51
56
### Run eval tasks
52
57
53
-
1. We create[eval_config.yaml](./eval_config.yaml) to store all the arguments and hyperparameters. This is the main config file you need to change and part of eval_config.yaml looks like this:
58
+
1. We created[eval_config.yaml](./eval_config.yaml) to store all the arguments and hyperparameters. This is the main config file you need to change if you want to eval other models, and a part of eval_config.yaml looks like this:
54
59
55
60
```yaml
56
61
model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct"# The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
@@ -71,25 +76,25 @@ data_parallel_size: 4 # The VLLM argument that speicify the data parallel size f
71
76
72
77
Change `model_name` to the model name you want to eval on and change the `evals_dataset` according to the model type and parameters. Remember to adjust the `tensor_parallel_size` to 2 or more to load the 70B models and change the `data_parallel_size` accordingly so that `tensor_parallel_size * data_parallel_size` is the number of GPUs you have. Please read the comments inside this yaml for detailed explanations on other parameters.
73
78
74
-
2. We already included all the related eval task yaml and python files in the [meta_template](./meta_template/) folder, which defines all the task implementation. You do not need to change those manually, we will use [prepare_meta_eval.py](./prepare_meta_eval.py) to automatically change them later.
79
+
2. We already included all the related eval task yaml and python files in the [meta_template](./meta_template/) folder, which define all the task implementation. You do not need to change those manually, we will use [prepare_meta_eval.py](./prepare_meta_eval.py) to automatically change them later.
75
80
76
-
3. Then we can run a [prepare_meta_eval.py](./prepare_meta_eval.py) that reads the configuration from [eval_config.yaml](./eval_config.yaml), copies everything in the template folder to a working folder `work_dir`, makes modification to those templates accordingly, prepares dataset if needed and print out the CLI command to run the `lm_eval`.
81
+
3. Then we can run [prepare_meta_eval.py](./prepare_meta_eval.py) that reads the configuration from [eval_config.yaml](./eval_config.yaml), copies everything in the template folder to a working folder `work_dir`, makes modification to those templates accordingly, prepares dataset if needed and prints out the CLI command to run the `lm_eval`.
77
82
78
83
To run the [prepare_meta_eval.py](./prepare_meta_eval.py), we can do:
By default,this will load the default [eval_config.yaml](./eval_config.yaml) config and print out a CLI command to run `meta_instruct` group tasks, which includes `meta_ifeval`, `meta_math_hard`, `meta_gpqa` and `meta_mmlu_pro_instruct`, for `meta-llama/Meta-Llama-3.1-8B-Instruct` model using `meta-llama/Meta-Llama-3.1-8B-Instruct-evals` dataset using`lm_eval`.
89
+
By default,this will load the default [eval_config.yaml](./eval_config.yaml) config and print out a CLI command to run `meta_instruct` group tasks, which includes `meta_ifeval`, `meta_math_hard`, `meta_gpqa` and `meta_mmlu_pro_instruct`, for `meta-llama/Meta-Llama-3.1-8B-Instruct` model using `meta-llama/Meta-Llama-3.1-8B-Instruct-evals` dataset and`lm_eval`.
85
90
86
91
An example output from [prepare_meta_eval.py](./prepare_meta_eval.py) looks like this:
4. Then just copy the command printed from[prepare_meta_eval.py](./prepare_meta_eval.py) back to your terminal and run it to get our reproduced result, saved into `eval_results` folder by default.
97
+
4. Then just copy the `lm_eval`command printed by[prepare_meta_eval.py](./prepare_meta_eval.py) back to your terminal and run it to get our reproduced result, which will be saved into `eval_results` folder by default.
93
98
94
99
**NOTE**: As for `--model vllm`, here we will use VLLM inference instead of Hugging Face inference because of the padding issue. By default, for the generative tasks, the `lm-eval --model_args="{...}" --batch_size=auto` command will use Hugging Face inference solution that uses a static batch method with [left padding](https://github.com/EleutherAI/lm-evaluation-harness/blob/8ad598dfd305ece8c6c05062044442d207279a97/lm_eval/models/huggingface.py#L773) using EOS_token for Llama models, but our internal evaluation will load python original checkpoints and handle individual generation request asynchronously without any padding. To simulate this, we will use VLLM inference solution to do dynamic batching without any padding.
0 commit comments