You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/vllm-acceleration/4-accuracy-benchmarking.md
+60-18Lines changed: 60 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,22 +30,26 @@ You should have:
30
30
The quantized model directory (for example, DeepSeek-V2-Lite-w4a8dyn-mse-channelwise) will be used as input for INT4 evaluation.
31
31
If you haven’t quantized a model, you can still evaluate your BF16 baseline to establish a reference accuracy.
32
32
33
-
## Install lm-eval-harness
33
+
## Install LM Evaluation Harness
34
34
35
-
Install the harness with vLLM extras in your active Python environment:
35
+
You will install the LM Evaluation Harness with vLLM backend support, allowing direct evaluation against your running vLLM server.
36
+
37
+
Install it inside your active Python environment:
36
38
37
39
```bash
38
40
pip install "lm_eval[vllm]"
39
41
pip install ray
40
42
```
41
43
42
44
{{% notice Tip %}}
43
-
If your benchmarks include gated models or datasets, run `huggingface-cli login` first so the harness can download what it needs.
45
+
If your benchmarks include gated models or restricted datasets, run `huggingface-cli login`
46
+
This ensures the harness can authenticate with Hugging Face and download any protected resources needed for evaluation.
44
47
{{% /notice %}}
45
48
46
-
## Recommended runtime settings for Arm CPU
49
+
## Recommended Runtime Settings for Arm CPU
47
50
48
-
Export the same performance-oriented environment variables used for serving. These enable Arm-optimized kernels through oneDNN+ACL and consistent thread pinning:
51
+
Before running accuracy benchmarks, export the same performance tuned environment variables you used for serving.
52
+
These settings ensure vLLM runs with Arm-optimized kernels (via oneDNN + Arm Compute Library) and consistent thread affinity across all CPU cores during evaluation.
|**`VLLM_TARGET_DEVICE=cpu`**| Forces vLLM to run entirely on CPU, ensuring evaluation results use Arm-optimized oneDNN kernels. |
69
+
|**`VLLM_CPU_KVCACHE_SPACE=32`**| Reserves 32 GB for key/value caches used in attention. Adjust if evaluating with longer contexts or larger batches. |
70
+
|**`VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))"`**| Pins OpenMP worker threads to physical cores (0–N-1) to minimize OS thread migration and improve cache locality. |
71
+
|**`VLLM_MLA_DISABLE=1`**| Disables GPU/MLA probing for faster initialization in CPU-only mode. |
72
+
|**`ONEDNN_DEFAULT_FPMATH_MODE=BF16`**| Enables **bfloat16** math mode, using reduced precision operations for faster compute while maintaining numerical stability. |
73
+
|**`OMP_NUM_THREADS="$(nproc)"`**| Uses all available CPU cores to parallelize matrix multiplications and attention layers. |
74
+
|**`LD_PRELOAD`**| Preloads **tcmalloc** (Thread-Caching Malloc) to reduce memory allocator contention under high concurrency. |
75
+
60
76
{{% notice Note %}}
61
-
`LD_PRELOAD` uses tcmalloc to reduce allocator contention. Install it via `sudo apt-get install -y libtcmalloc-minimal4` if you haven’t already.
77
+
tcmalloc helps reduce allocator overhead when running multiple evaluation tasks in parallel.
78
+
If it’s not installed, add it with `sudo apt-get install -y libtcmalloc-minimal4`
62
79
{{% /notice %}}
63
80
64
-
## Accuracy Benchmarking Meta‑Llama‑3.1‑8B‑Instruct BF16 model
Run with a non-quantized model. Replace the model ID as needed.
83
+
To establish a baseline accuracy reference, evaluate a non-quantized BF16 model served through vLLM.
84
+
This run measures how the original model performs under Arm-optimized BF16 inference before applying INT4 quantization.
85
+
Replace the model ID if you are using a different model variant or checkpoint.
67
86
68
87
```bash
69
88
lm_eval \
@@ -74,12 +93,16 @@ lm_eval \
74
93
--batch_size auto \
75
94
--output_path results
76
95
```
96
+
After completing this test, review the results directory for accuracy metrics (e.g., acc_norm, acc) and record them as your BF16 baseline.
77
97
78
-
## Accuracy Benchmarking INT4 quantized model
98
+
Next, you’ll run the same benchmarks on the INT4 quantized model to compare accuracy across precisions.
79
99
80
-
Use the INT4 quantization recipe & script from previous steps to quantize `meta-llama/Meta-Llama-3.1-8B-Instruct` model
100
+
## Accuracy Benchmarking: INT4 quantized model
81
101
82
-
Channelwise INT4 (MSE):
102
+
Now that you’ve quantized your model using the INT4 recipe and script from the previous module, you can benchmark its accuracy using the same evaluation harness and task set.
103
+
This test compares quantized (INT4) performance against your BF16 baseline, revealing how much accuracy is preserved after compression.
104
+
Use the quantized directory generated earlier, for example:
After this evaluation, compare the results metrics from both runs:
93
117
94
118
## Interpreting results
95
119
96
-
The harness prints per-task and aggregate scores (for example, `acc`, `acc_norm`, `exact_match`). Higher is generally better. Compare BF16 vs INT4 on the same tasks to assess quality impact.
120
+
After running evaluations, the LM Evaluation Harness prints per-task and aggregate metrics such as acc, acc_norm, and exact_match.
121
+
These represent model accuracy across various datasets and question formats—higher values indicate better performance.
122
+
Key metrics include:
123
+
* acc – Standard accuracy (fraction of correct predictions).
124
+
* acc_norm – Normalized accuracy; adjusts for multiple-choice imbalance.
125
+
* exact_match – Strict string-level match, typically used for reasoning or QA tasks.
97
126
127
+
Compare BF16 and INT4 results on identical tasks to assess the accuracy–efficiency trade-off introduced by quantization.
98
128
Practical tips:
99
-
*Use the same tasks and few-shot settingsacross runs.
100
-
*For quick iteration, you can add `--limit 200`to run on a subset.
129
+
*Always use identical tasks, few-shot settings, and seeds across runs to ensure fair comparisons.
130
+
*Add --limit 200 for quick validation runs during tuning. This limits each task to 200 samples for faster iteration.
101
131
102
132
## Example results for Meta‑Llama‑3.1‑8B‑Instruct model
103
133
104
-
These illustrative results are representative; actual scores may vary across hardware, dataset versions, and harness releases. Higher values indicate better accuracy.
134
+
The following results are illustrative and serve as reference points.
135
+
Your actual scores may differ based on hardware, dataset version, or lm-eval-harness release.
Use these as ballpark expectations to check whether your runs are in a reasonable range, not as official targets.
143
+
How to interpret:
144
+
145
+
* BF16 baseline – Represents near-FP32 accuracy; serves as your quality reference.
146
+
* INT4 Groupwise minmax – Retains almost all performance while reducing model size ~4× and improving throughput substantially.
147
+
* INT4 Channelwise MSE – Slightly lower accuracy, often within 2–3 percentage points of BF16, still competitive for most production use cases.
113
148
114
149
## Next steps
115
150
116
-
* Try additional tasks to match your usecase: `gsm8k`, `winogrande`, `arc_easy`, `arc_challenge`.
117
-
* Sweep quantization recipes (minmax vs mse; channelwise vs groupwise, group size) to find a better accuracy/performance balance.
151
+
* Broaden accuracy testing to cover reasoning, math, and commonsense tasks that reflect your real-world use cases:
152
+
GSM8K – Arithmetic and logical reasoning (sensitive to quantization).
153
+
Winogrande – Commonsense and pronoun disambiguation.
154
+
ARC-Easy / ARC-Challenge – Science and multi-step reasoning questions.
155
+
Running multiple benchmarks gives a more comprehensive picture of model robustness under different workloads.
156
+
157
+
* Experiment with different quantization configurations to find the best accuracy–throughput trade-off for your hardware.
118
158
* Record both throughput and accuracy to choose the best configuration for your workload.
159
+
160
+
By iterating on these steps, you will build a custom performance and accuracy profile for your Arm deployment, helping you select the optimal quantization strategy and runtime configuration for your target workload.
0 commit comments