Skip to content

Commit 2f0a006

Browse files
committed
fix typo and readme
1 parent 4c9841e commit 2f0a006

File tree

3 files changed

+10
-4
lines changed

3 files changed

+10
-4
lines changed

.github/scripts/spellcheck_conf/wordlist.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1451,3 +1451,9 @@ openhathi
14511451
sarvam
14521452
subtask
14531453
acc
1454+
BigBench
1455+
IFEval
1456+
MuSR
1457+
Multistep
1458+
multistep
1459+
algorithmically

tools/benchmarks/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Benchmarks
22

33
* inference - a folder contains benchmark scripts that apply a throughput analysis for Llama models inference on various backends including on-prem, cloud and on-device.
4-
* llm_eval_harness - a folder contains a tool to evaluate fine-tuned Llama models including quantized models focusing on quality.
4+
* llm_eval_harness - a folder that introduces `lm-evaluation-harness`, a tool to evaluate Llama models including quantized models focusing on quality. We also included a recipe that reproduces Meta 3.1 evaluation metrics Using `lm-evaluation-harness` and instructions that reproduce HuggingFace Open LLM Leaderboard v2 metrics.

tools/benchmarks/llm_eval_harness/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ In the HF leaderboard v2, the [LLMs are evaluated on 6 benchmarks](https://huggi
151151

152152
- **IFEval**: [IFEval](https://arxiv.org/abs/2311.07911) is a dataset designed to test a model’s ability to follow explicit instructions, such as “include keyword x” or “use format y.” The focus is on the model’s adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics.
153153
- **BBH (Big Bench Hard)**: [BBH](https://arxiv.org/abs/2210.09261) is a subset of 23 challenging tasks from the BigBench dataset to evaluate language models. The tasks use objective metrics, are highly difficult, and have sufficient sample sizes for statistical significance. They include multistep arithmetic, algorithmic reasoning (e.g., boolean expressions, SVG shapes), language understanding (e.g., sarcasm detection, name disambiguation), and world knowledge. BBH performance correlates well with human preferences, providing valuable insights into model capabilities.
154-
- **MATH**: [MATH](https://arxiv.org/abs/2103.03874) is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only level 5 MATH questions and call it MATH Lvl 5.
154+
- **MATH**: [MATH](https://arxiv.org/abs/2103.03874) is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and asymptote for figures. Generations must fit a very specific output format. We keep only level 5 MATH questions and call it MATH Level 5.
155155
- **GPQA (Graduate-Level Google-Proof Q&A Benchmark)**: [GPQA](https://arxiv.org/abs/2311.12022) is a highly challenging knowledge dataset with questions crafted by PhD-level domain experts in fields like biology, physics, and chemistry. These questions are designed to be difficult for laypersons but relatively easy for experts. The dataset has undergone multiple rounds of validation to ensure both difficulty and factual accuracy. Access to GPQA is restricted through gating mechanisms to minimize the risk of data contamination. Consequently, we do not provide plain text examples from this dataset, as requested by the authors.
156156
- **MuSR (Multistep Soft Reasoning)**: [MuSR](https://arxiv.org/abs/2310.16049) is a new dataset consisting of algorithmically generated complex problems, each around 1,000 words in length. The problems include murder mysteries, object placement questions, and team allocation optimizations. Solving these problems requires models to integrate reasoning with long-range context parsing. Few models achieve better than random performance on this dataset.
157157
- **MMLU-PRO (Massive Multitask Language Understanding - Professional)**: [MMLU-Pro](https://arxiv.org/abs/2406.01574) is a refined version of the MMLU dataset, which has been a standard for multiple-choice knowledge assessment. Recent research identified issues with the original MMLU, such as noisy data (some unanswerable questions) and decreasing difficulty due to advances in model capabilities and increased data contamination. MMLU-Pro addresses these issues by presenting models with 10 choices instead of 4, requiring reasoning on more questions, and undergoing expert review to reduce noise. As a result, MMLU-Pro is of higher quality and currently more challenging than the original.
@@ -164,13 +164,13 @@ To run a leaderboard evaluation for `Meta-Llama-3.1-8B`, we can run the followin
164164
accelerate launch -m lm_eval --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,dtype=bfloat16 --log_samples --output_path eval_results --tasks leaderboard --batch_size 4
165165
```
166166

167-
Similarily to run a leaderboard evaluation for `Meta-Llama-3.1-8B-Instruct`, we can run the following, using `--apply_chat_template --fewshot_as_multiturn`:
167+
Similarly to run a leaderboard evaluation for `Meta-Llama-3.1-8B-Instruct`, we can run the following, using `--apply_chat_template --fewshot_as_multiturn`:
168168

169169
```bash
170170
accelerate launch -m lm_eval --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 --log_samples --output_path eval_results --tasks leaderboard --batch_size 4 --apply_chat_template --fewshot_as_multiturn
171171
```
172172

173-
As for 70B models, it is required to run tensor parallellism as it can not fit into 1 GPU, therefore we can run the following for `Meta-Llama-3.1-70B-Instruct`:
173+
As for 70B models, it is required to run tensor parallelism as it can not fit into 1 GPU, therefore we can run the following for `Meta-Llama-3.1-70B-Instruct`:
174174

175175
```bash
176176
lm_eval --model hf --batch_size 4 --model_args pretrained=meta-llama/Meta-Llama-3.1-70B-Instruct,parallelize=True --tasks leaderboard --log_samples --output_path eval_results --apply_chat_template --fewshot_as_multiturn

0 commit comments

Comments
 (0)