Skip to content

Commit 8443489

Browse files
committed
fix readme
1 parent e1b7bc7 commit 8443489

File tree

2 files changed

+13
-13
lines changed

2 files changed

+13
-13
lines changed

tools/benchmarks/llm_eval_harness/README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ pip install -e .
4040
To run evaluation for Hugging Face `Llama3.1 8B` model on a single GPU please run the following,
4141

4242
```bash
43-
lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B --tasks hellaswag --device cuda:0 --batch_size 8
43+
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.1-8B --tasks hellaswag --device cuda:0 --batch_size 8
4444

4545
```
4646
Tasks can be extended by using `,` between them for example `--tasks hellaswag,arc`.
@@ -52,15 +52,15 @@ To set the number of shots you can use `--num_fewshot` to set the number for few
5252
In case you have fine-tuned your model using PEFT you can set the PATH to the PEFT checkpoints using PEFT as part of model_args as shown below:
5353

5454
```bash
55-
lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B, dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10 --device cuda:0 --batch_size 8
55+
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.1-8B, dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10 --device cuda:0 --batch_size 8
5656
```
5757

5858
### Limit the number of examples in benchmarks
5959

6060
There has been an study from [IBM on efficient benchmarking of LLMs](https://arxiv.org/pdf/2308.11696.pdf), with main take a way that to identify if a model is performing poorly, benchmarking on wider range of tasks is more important than the number example in each task. This means you could run the evaluation harness with fewer number of example to have initial decision if the performance got worse from the base line. To limit the number of example here, it can be set using `--limit` flag with actual desired number. But for the full assessment you would need to run the full evaluation. Please read more in the paper linked above.
6161

6262
```bash
63-
lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10 --device cuda:0 --batch_size 8 --limit 100
63+
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.1-8B,dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10 --device cuda:0 --batch_size 8 --limit 100
6464
```
6565

6666
### Customized Llama Model
@@ -77,7 +77,7 @@ To perform *data-parallel evaluation* (where each GPU loads a **separate full co
7777

7878
```bash
7979
accelerate launch -m lm_eval --model hf \
80-
--model_args pretrained=meta-llama/Meta-Llama-3.1-8B \
80+
--model_args pretrained=meta-llama/Llama-3.1-8B \
8181
--tasks lambada_openai,arc_easy \
8282
--batch_size 16
8383
```
@@ -94,7 +94,7 @@ In this setting, run the library *outside the `accelerate` launcher*, but passin
9494
```
9595
lm_eval --model hf \
9696
--tasks lambada_openai,arc_easy \
97-
--model_args pretrained=meta-llama/Meta-Llama-3.1-70B,parallelize=True \
97+
--model_args pretrained=meta-llama/Llama-3.1-70B,parallelize=True \
9898
--batch_size 16
9999
```
100100

@@ -111,7 +111,7 @@ There is also an option to run with tensor parallel and data parallel together.
111111
```
112112
accelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \
113113
-m lm_eval --model hf \
114-
--model_args pretrained=meta-llama/Meta-Llama-3.1-70B \
114+
--model_args pretrained=meta-llama/Llama-3.1-70B \
115115
--tasks lambada_openai,arc_easy \
116116
--model_args parallelize=True \
117117
--batch_size 16
@@ -158,20 +158,20 @@ In the HF leaderboard v2, the [LLMs are evaluated on 6 benchmarks](https://huggi
158158

159159
In order to install correct lm-evaluation-harness version, please check the Huggingface 🤗 Open LLM Leaderboard v2 [reproducibility section](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#reproducibility).
160160

161-
To run a leaderboard evaluation for `Meta-Llama-3.1-8B`, we can run the following:
161+
To run a leaderboard evaluation for `Llama-3.1-8B`, we can run the following:
162162

163163
```bash
164-
accelerate launch -m lm_eval --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,dtype=bfloat16 --log_samples --output_path eval_results --tasks leaderboard --batch_size 4
164+
accelerate launch -m lm_eval --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16 --log_samples --output_path eval_results --tasks leaderboard --batch_size 4
165165
```
166166

167-
Similarly to run a leaderboard evaluation for `Meta-Llama-3.1-8B-Instruct`, we can run the following, using `--apply_chat_template --fewshot_as_multiturn`:
167+
Similarly to run a leaderboard evaluation for `Llama-3.1-8B-Instruct`, we can run the following, using `--apply_chat_template --fewshot_as_multiturn`:
168168

169169
```bash
170-
accelerate launch -m lm_eval --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 --log_samples --output_path eval_results --tasks leaderboard --batch_size 4 --apply_chat_template --fewshot_as_multiturn
170+
accelerate launch -m lm_eval --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,dtype=bfloat16 --log_samples --output_path eval_results --tasks leaderboard --batch_size 4 --apply_chat_template --fewshot_as_multiturn
171171
```
172172

173-
As for 70B models, it is required to run tensor parallelism as it can not fit into 1 GPU, therefore we can run the following for `Meta-Llama-3.1-70B-Instruct`:
173+
As for 70B models, it is required to run tensor parallelism as it can not fit into 1 GPU, therefore we can run the following for `Llama-3.1-70B-Instruct`:
174174

175175
```bash
176-
lm_eval --model hf --batch_size 4 --model_args pretrained=meta-llama/Meta-Llama-3.1-70B-Instruct,parallelize=True --tasks leaderboard --log_samples --output_path eval_results --apply_chat_template --fewshot_as_multiturn
176+
lm_eval --model hf --batch_size 4 --model_args pretrained=meta-llama/Llama-3.1-70B-Instruct,parallelize=True --tasks leaderboard --log_samples --output_path eval_results --apply_chat_template --fewshot_as_multiturn
177177
```

tools/benchmarks/llm_eval_harness/meta_eval/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ As Llama models gain popularity, evaluating these models has become increasingly
66
## Disclaimer
77

88

9-
1. **This recipe is not the official implementation** of Llama evaluation. It is based on public third-party libraries, as this implementation is not mirroring Llama evaluation, this may lead to minor differences in the produced numbers.
9+
1. **This recipe is not the official implementation** of Llama evaluation. Since our internal eval repo isn't public, we want to provide this recipe as an aid for anyone who want to use the datasets we released. It is based on public third-party libraries, as this implementation is not mirroring Llama evaluation, therefore this may lead to minor differences in the produced numbers.
1010
2. **Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|>`. It will not work with models that are not based on Llama 3.
1111

1212
## Insights from Our Evaluation Process

0 commit comments

Comments
 (0)