Skip to content

Commit ebf51a2

Browse files
Refine accuracy benchmarking documentation
Updated the text for clarity and consistency, including changes to the title and various sections to improve readability and precision in instructions.
1 parent 093b889 commit ebf51a2

File tree

1 file changed

+47
-75
lines changed

1 file changed

+47
-75
lines changed
Lines changed: 47 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Evaluate Accuracy with LM Evaluation Harness
2+
title: Evaluate accuracy with LM Evaluation Harness
33
weight: 5
44

55
### FIXED, DO NOT MODIFY
@@ -8,48 +8,40 @@ layout: learningpathall
88

99
## Why accuracy benchmarking
1010

11-
The LM Evaluation Harness (lm-eval-harness) is a widely used open-source framework for evaluating the accuracy of large language models on standardized academic benchmarks such as MMLU, HellaSwag, and GSM8K.
12-
It provides a consistent interface for evaluating models served through various runtimes—such as Hugging Face Transformers, vLLM, or llama.cpp using the same datasets, few-shot templates, and scoring metrics.
13-
In this module, you will measure how quantization impacts model quality by comparing BF16 (non-quantized) and INT4 (quantized) versions of your model running on Arm-based servers.
11+
The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example, MMLU, HellaSwag, GSM8K) and runtimes (such as Hugging Face, vLLM, and llama.cpp). In this Learning Path, you'll run accuracy tests for both BF16 and INT4 deployments of your model served by vLLM on Arm-based servers.
1412

1513
You will:
16-
* Install lm-eval-harness with vLLM backend support.
17-
* Run benchmark tasks on both BF16 and INT4 model deployments.
18-
* Analyze and interpret accuracy differences between the two precisions.
14+
* Install lm-eval-harness with vLLM support
15+
* Run benchmarks on a BF16 model and an INT4 (weight-quantized) model
16+
* Interpret key metrics and compare quality across precisions
1917

2018
{{% notice Note %}}
21-
Accuracy results can vary depending on CPU, dataset versions, and model choice. Use the same tasks, few-shot settings and evaluation batch size when comparing BF16 and INT4 results to ensure a fair comparison.
19+
Results vary based on your CPU, dataset version, and model selection. For a fair comparison between BF16 and INT4, always use the same tasks and few-shot settings.
2220
{{% /notice %}}
2321

24-
## Prerequisites
2522

26-
Before you begin, make sure your environment is ready for evaluation.
27-
You should have:
28-
* Completed the optimized build from the “Overview and Optimized Build” section and successfully validated your vLLM installation.
29-
* (Optional) Quantized a model using the “Quantize an LLM to INT4 for Arm Platform” module.
30-
The quantized model directory (for example, DeepSeek-V2-Lite-w4a8dyn-mse-channelwise) will be used as input for INT4 evaluation.
31-
If you haven’t quantized a model, you can still evaluate your BF16 baseline to establish a reference accuracy.
23+
## Prerequisites
3224

33-
## Install LM Evaluation Harness
25+
Before you start:
26+
* Complete the optimized build in “Overview and Optimized Build” and validate your vLLM install.
27+
* Optionally quantize a model using the “Quantize an LLM to INT4 for Arm Platform” module. We’ll reference the output directory name from that step.
3428

35-
You will install the LM Evaluation Harness with vLLM backend support, allowing direct evaluation against your running vLLM server.
29+
## Install lm-eval-harness
3630

37-
Install it inside your active Python environment:
31+
Install the harness with vLLM extras in your active Python environment:
3832

3933
```bash
4034
pip install "lm_eval[vllm]"
4135
pip install ray
4236
```
4337

4438
{{% notice Tip %}}
45-
If your benchmarks include gated models or restricted datasets, run `huggingface-cli login`
46-
This ensures the harness can authenticate with Hugging Face and download any protected resources needed for evaluation.
39+
If your benchmarks include gated models or datasets, run `huggingface-cli login` first so the harness can download what it needs.
4740
{{% /notice %}}
4841

49-
## Recommended Runtime Settings for Arm CPU
42+
## Recommended runtime settings for Arm CPU
5043

51-
Before running accuracy benchmarks, export the same performance tuned environment variables you used for serving.
52-
These settings ensure vLLM runs with Arm-optimized kernels (via oneDNN + Arm Compute Library) and consistent thread affinity across all CPU cores during evaluation.
44+
Export the same performance-oriented environment variables used for serving. These enable Arm-optimized kernels through oneDNN+ACL and consistent thread pinning:
5345

5446
```bash
5547
export VLLM_TARGET_DEVICE=cpu
@@ -61,28 +53,13 @@ export OMP_NUM_THREADS="$(nproc)"
6153
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4
6254
```
6355

64-
Explanation of settings
65-
66-
| Variable | Purpose |
67-
| --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
68-
| **`VLLM_TARGET_DEVICE=cpu`** | Forces vLLM to run entirely on CPU, ensuring evaluation results use Arm-optimized oneDNN kernels. |
69-
| **`VLLM_CPU_KVCACHE_SPACE=32`** | Reserves 32 GB for key/value caches used in attention. Adjust if evaluating with longer contexts or larger batches. |
70-
| **`VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))"`** | Pins OpenMP worker threads to physical cores (0–N-1) to minimize OS thread migration and improve cache locality. |
71-
| **`VLLM_MLA_DISABLE=1`** | Disables GPU/MLA probing for faster initialization in CPU-only mode. |
72-
| **`ONEDNN_DEFAULT_FPMATH_MODE=BF16`** | Enables **bfloat16** math mode, using reduced precision operations for faster compute while maintaining numerical stability. |
73-
| **`OMP_NUM_THREADS="$(nproc)"`** | Uses all available CPU cores to parallelize matrix multiplications and attention layers. |
74-
| **`LD_PRELOAD`** | Preloads **tcmalloc** (Thread-Caching Malloc) to reduce memory allocator contention under high concurrency. |
75-
7656
{{% notice Note %}}
77-
tcmalloc helps reduce allocator overhead when running multiple evaluation tasks in parallel.
78-
If it’s not installed, add it with `sudo apt-get install -y libtcmalloc-minimal4`
57+
`LD_PRELOAD` uses tcmalloc to reduce allocator contention. Install it via `sudo apt-get install -y libtcmalloc-minimal4` if you haven’t already.
7958
{{% /notice %}}
8059

81-
## Accuracy Benchmarking Meta‑Llama‑3.1‑8B‑Instruct (BF16 Model)
60+
## Accuracy Benchmarking Meta‑Llama‑3.1‑8B‑Instruct BF16 model
8261

83-
To establish a baseline accuracy reference, evaluate a non-quantized BF16 model served through vLLM.
84-
This run measures how the original model performs under Arm-optimized BF16 inference before applying INT4 quantization.
85-
Replace the model ID if you are using a different model variant or checkpoint.
62+
Run with a non-quantized model. Replace the model ID as needed.
8663

8764
```bash
8865
lm_eval \
@@ -93,16 +70,26 @@ lm_eval \
9370
--batch_size auto \
9471
--output_path results
9572
```
96-
After completing this test, review the results directory for accuracy metrics (e.g., acc_norm, acc) and record them as your BF16 baseline.
9773

98-
Next, you’ll run the same benchmarks on the INT4 quantized model to compare accuracy across precisions.
74+
## Benchmark INT4 quantized model accuracy
9975

100-
## Accuracy Benchmarking: INT4 quantized model
76+
Run accuracy tests on your INT4 quantized model using the same tasks and settings as the BF16 baseline. Replace the model path with your quantized output directory.
10177

102-
Now that you’ve quantized your model using the INT4 recipe and script from the previous module, you can benchmark its accuracy using the same evaluation harness and task set.
103-
This test compares quantized (INT4) performance against your BF16 baseline, revealing how much accuracy is preserved after compression.
104-
Use the quantized directory generated earlier, for example:
105-
Meta-Llama-3.1-8B-Instruct-w4a8dyn-mse-channelwise.
78+
```bash
79+
lm_eval \
80+
--model vllm \
81+
--model_args \
82+
pretrained=Meta-Llama-3.1-8B-Instruct-w4a8dyn-mse-channelwise,dtype=float32,max_model_len=4096,enforce_eager=True \
83+
--tasks mmlu,hellaswag \
84+
--batch_size auto \
85+
--output_path results
86+
```
87+
88+
The expected output includes per-task accuracy metrics. Compare these results to your BF16 baseline to evaluate the impact of INT4 quantization on model quality.
89+
90+
Use the INT4 quantization recipe & script from previous steps to quantize `meta-llama/Meta-Llama-3.1-8B-Instruct` model.
91+
92+
Channelwise INT4 (MSE):
10693

10794
```bash
10895
lm_eval \
@@ -113,48 +100,33 @@ lm_eval \
113100
--batch_size auto \
114101
--output_path results
115102
```
116-
After this evaluation, compare the results metrics from both runs:
117103

118-
## Interpreting results
104+
## Interpret the results
119105

120-
After running evaluations, the LM Evaluation Harness prints per-task and aggregate metrics such as acc, acc_norm, and exact_match.
121-
These represent model accuracy across various datasets and question formats—higher values indicate better performance.
122-
Key metrics include:
123-
* acc – Standard accuracy (fraction of correct predictions).
124-
* acc_norm – Normalized accuracy; adjusts for multiple-choice imbalance.
125-
* exact_match – Strict string-level match, typically used for reasoning or QA tasks.
106+
The harness prints per-task and aggregate scores (for example, `acc`, `acc_norm`, `exact_match`). Higher is generally better. Compare BF16 vs INT4 on the same tasks to assess quality impact.
126107

127-
Compare BF16 and INT4 results on identical tasks to assess the accuracy–efficiency trade-off introduced by quantization.
128108
Practical tips:
129-
* Always use identical tasks, few-shot settings, and seeds across runs to ensure fair comparisons.
130-
* Add --limit 200 for quick validation runs during tuning. This limits each task to 200 samples for faster iteration.
109+
* Use the same tasks and few-shot settings across runs.
110+
* For quick iteration, you can add `--limit 200` to run on a subset.
131111

132-
## Example results for Meta‑Llama‑3.1‑8B‑Instruct model
112+
## Explore example results for Meta‑Llama‑3.1‑8B‑Instruct model
133113

134-
The following results are illustrative and serve as reference points.
135-
Your actual scores may differ based on hardware, dataset version, or lm-eval-harness release.
114+
These illustrative results are representative; actual scores may vary across hardware, dataset versions, and harness releases. Higher values indicate better accuracy.
136115

137116
| Variant | MMLU (acc±err) | HellaSwag (acc±err) |
138117
|---------------------------------|-------------------|---------------------|
139118
| BF16 | 0.5897 ± 0.0049 | 0.7916 ± 0.0041 |
140119
| INT4 Groupwise minmax (G=32) | 0.5831 ± 0.0049 | 0.7819 ± 0.0041 |
141120
| INT4 Channelwise MSE | 0.5712 ± 0.0049 | 0.7633 ± 0.0042 |
142121

143-
How to interpret:
144-
145-
* BF16 baseline – Represents near-FP32 accuracy; serves as your quality reference.
146-
* INT4 Groupwise minmax – Retains almost all performance while reducing model size ~4× and improving throughput substantially.
147-
* INT4 Channelwise MSE – Slightly lower accuracy, often within 2–3 percentage points of BF16, still competitive for most production use cases.
122+
Use these as ballpark expectations to check whether your runs are in a reasonable range, not as official targets.
148123

149124
## Next steps
150125

151-
* Broaden accuracy testing to cover reasoning, math, and commonsense tasks that reflect your real-world use cases:
152-
GSM8K – Arithmetic and logical reasoning (sensitive to quantization).
153-
Winogrande – Commonsense and pronoun disambiguation.
154-
ARC-Easy / ARC-Challenge – Science and multi-step reasoning questions.
155-
Running multiple benchmarks gives a more comprehensive picture of model robustness under different workloads.
126+
Now that you've completed accuracy benchmarking for both BF16 and INT4 models on Arm-based servers, you're ready to deepen your evaluation and optimize for your specific use case. Expanding your benchmarks to additional tasks helps you understand model performance across a wider range of scenarios. Experimenting with different quantization recipes lets you balance accuracy and throughput for your workload.
156127

157-
* Experiment with different quantization configurations to find the best accuracy–throughput trade-off for your hardware.
158-
* Record both throughput and accuracy to choose the best configuration for your workload.
128+
- Try additional tasks to match your use case: `gsm8k`, `winogrande`, `arc_easy`, `arc_challenge`.
129+
- Sweep quantization recipes (minmax vs mse; channelwise vs groupwise, group size) to find a better accuracy/performance balance.
130+
- Record both throughput and accuracy to choose the best configuration for your workload.
159131

160-
By iterating on these steps, you will build a custom performance and accuracy profile for your Arm deployment, helping you select the optimal quantization strategy and runtime configuration for your target workload.
132+
You've learned how to set up lm-evaluation-harness, run benchmarks for BF16 and INT4 models, and interpret key accuracy metrics on Arm platforms. Great job reaching this milestone—your results will help you make informed decisions about model deployment and optimization!

0 commit comments

Comments
 (0)