Skip to content

Commit 9422f94

Browse files
authored
Update 4-accuracy-benchmarking.md
1 parent 01b3d02 commit 9422f94

File tree

1 file changed

+60
-18
lines changed

1 file changed

+60
-18
lines changed

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/4-accuracy-benchmarking.md

Lines changed: 60 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -30,22 +30,26 @@ You should have:
3030
The quantized model directory (for example, DeepSeek-V2-Lite-w4a8dyn-mse-channelwise) will be used as input for INT4 evaluation.
3131
If you haven’t quantized a model, you can still evaluate your BF16 baseline to establish a reference accuracy.
3232

33-
## Install lm-eval-harness
33+
## Install LM Evaluation Harness
3434

35-
Install the harness with vLLM extras in your active Python environment:
35+
You will install the LM Evaluation Harness with vLLM backend support, allowing direct evaluation against your running vLLM server.
36+
37+
Install it inside your active Python environment:
3638

3739
```bash
3840
pip install "lm_eval[vllm]"
3941
pip install ray
4042
```
4143

4244
{{% notice Tip %}}
43-
If your benchmarks include gated models or datasets, run `huggingface-cli login` first so the harness can download what it needs.
45+
If your benchmarks include gated models or restricted datasets, run `huggingface-cli login`
46+
This ensures the harness can authenticate with Hugging Face and download any protected resources needed for evaluation.
4447
{{% /notice %}}
4548

46-
## Recommended runtime settings for Arm CPU
49+
## Recommended Runtime Settings for Arm CPU
4750

48-
Export the same performance-oriented environment variables used for serving. These enable Arm-optimized kernels through oneDNN+ACL and consistent thread pinning:
51+
Before running accuracy benchmarks, export the same performance tuned environment variables you used for serving.
52+
These settings ensure vLLM runs with Arm-optimized kernels (via oneDNN + Arm Compute Library) and consistent thread affinity across all CPU cores during evaluation.
4953

5054
```bash
5155
export VLLM_TARGET_DEVICE=cpu
@@ -57,13 +61,28 @@ export OMP_NUM_THREADS="$(nproc)"
5761
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4
5862
```
5963

64+
Explanation of settings
65+
66+
| Variable | Purpose |
67+
| --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
68+
| **`VLLM_TARGET_DEVICE=cpu`** | Forces vLLM to run entirely on CPU, ensuring evaluation results use Arm-optimized oneDNN kernels. |
69+
| **`VLLM_CPU_KVCACHE_SPACE=32`** | Reserves 32 GB for key/value caches used in attention. Adjust if evaluating with longer contexts or larger batches. |
70+
| **`VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))"`** | Pins OpenMP worker threads to physical cores (0–N-1) to minimize OS thread migration and improve cache locality. |
71+
| **`VLLM_MLA_DISABLE=1`** | Disables GPU/MLA probing for faster initialization in CPU-only mode. |
72+
| **`ONEDNN_DEFAULT_FPMATH_MODE=BF16`** | Enables **bfloat16** math mode, using reduced precision operations for faster compute while maintaining numerical stability. |
73+
| **`OMP_NUM_THREADS="$(nproc)"`** | Uses all available CPU cores to parallelize matrix multiplications and attention layers. |
74+
| **`LD_PRELOAD`** | Preloads **tcmalloc** (Thread-Caching Malloc) to reduce memory allocator contention under high concurrency. |
75+
6076
{{% notice Note %}}
61-
`LD_PRELOAD` uses tcmalloc to reduce allocator contention. Install it via `sudo apt-get install -y libtcmalloc-minimal4` if you haven’t already.
77+
tcmalloc helps reduce allocator overhead when running multiple evaluation tasks in parallel.
78+
If it’s not installed, add it with `sudo apt-get install -y libtcmalloc-minimal4`
6279
{{% /notice %}}
6380

64-
## Accuracy Benchmarking Meta‑Llama‑3.1‑8B‑Instruct BF16 model
81+
## Accuracy Benchmarking Meta‑Llama‑3.1‑8B‑Instruct (BF16 Model)
6582

66-
Run with a non-quantized model. Replace the model ID as needed.
83+
To establish a baseline accuracy reference, evaluate a non-quantized BF16 model served through vLLM.
84+
This run measures how the original model performs under Arm-optimized BF16 inference before applying INT4 quantization.
85+
Replace the model ID if you are using a different model variant or checkpoint.
6786

6887
```bash
6988
lm_eval \
@@ -74,12 +93,16 @@ lm_eval \
7493
--batch_size auto \
7594
--output_path results
7695
```
96+
After completing this test, review the results directory for accuracy metrics (e.g., acc_norm, acc) and record them as your BF16 baseline.
7797

78-
## Accuracy Benchmarking INT4 quantized model
98+
Next, you’ll run the same benchmarks on the INT4 quantized model to compare accuracy across precisions.
7999

80-
Use the INT4 quantization recipe & script from previous steps to quantize `meta-llama/Meta-Llama-3.1-8B-Instruct` model
100+
## Accuracy Benchmarking: INT4 quantized model
81101

82-
Channelwise INT4 (MSE):
102+
Now that you’ve quantized your model using the INT4 recipe and script from the previous module, you can benchmark its accuracy using the same evaluation harness and task set.
103+
This test compares quantized (INT4) performance against your BF16 baseline, revealing how much accuracy is preserved after compression.
104+
Use the quantized directory generated earlier, for example:
105+
Meta-Llama-3.1-8B-Instruct-w4a8dyn-mse-channelwise.
83106

84107
```bash
85108
lm_eval \
@@ -90,29 +113,48 @@ lm_eval \
90113
--batch_size auto \
91114
--output_path results
92115
```
116+
After this evaluation, compare the results metrics from both runs:
93117

94118
## Interpreting results
95119

96-
The harness prints per-task and aggregate scores (for example, `acc`, `acc_norm`, `exact_match`). Higher is generally better. Compare BF16 vs INT4 on the same tasks to assess quality impact.
120+
After running evaluations, the LM Evaluation Harness prints per-task and aggregate metrics such as acc, acc_norm, and exact_match.
121+
These represent model accuracy across various datasets and question formats—higher values indicate better performance.
122+
Key metrics include:
123+
* acc – Standard accuracy (fraction of correct predictions).
124+
* acc_norm – Normalized accuracy; adjusts for multiple-choice imbalance.
125+
* exact_match – Strict string-level match, typically used for reasoning or QA tasks.
97126

127+
Compare BF16 and INT4 results on identical tasks to assess the accuracy–efficiency trade-off introduced by quantization.
98128
Practical tips:
99-
* Use the same tasks and few-shot settings across runs.
100-
* For quick iteration, you can add `--limit 200` to run on a subset.
129+
* Always use identical tasks, few-shot settings, and seeds across runs to ensure fair comparisons.
130+
* Add --limit 200 for quick validation runs during tuning. This limits each task to 200 samples for faster iteration.
101131

102132
## Example results for Meta‑Llama‑3.1‑8B‑Instruct model
103133

104-
These illustrative results are representative; actual scores may vary across hardware, dataset versions, and harness releases. Higher values indicate better accuracy.
134+
The following results are illustrative and serve as reference points.
135+
Your actual scores may differ based on hardware, dataset version, or lm-eval-harness release.
105136

106137
| Variant | MMLU (acc±err) | HellaSwag (acc±err) |
107138
|---------------------------------|-------------------|---------------------|
108139
| BF16 | 0.5897 ± 0.0049 | 0.7916 ± 0.0041 |
109140
| INT4 Groupwise minmax (G=32) | 0.5831 ± 0.0049 | 0.7819 ± 0.0041 |
110141
| INT4 Channelwise MSE | 0.5712 ± 0.0049 | 0.7633 ± 0.0042 |
111142

112-
Use these as ballpark expectations to check whether your runs are in a reasonable range, not as official targets.
143+
How to interpret:
144+
145+
* BF16 baseline – Represents near-FP32 accuracy; serves as your quality reference.
146+
* INT4 Groupwise minmax – Retains almost all performance while reducing model size ~4× and improving throughput substantially.
147+
* INT4 Channelwise MSE – Slightly lower accuracy, often within 2–3 percentage points of BF16, still competitive for most production use cases.
113148

114149
## Next steps
115150

116-
* Try additional tasks to match your usecase: `gsm8k`, `winogrande`, `arc_easy`, `arc_challenge`.
117-
* Sweep quantization recipes (minmax vs mse; channelwise vs groupwise, group size) to find a better accuracy/performance balance.
151+
* Broaden accuracy testing to cover reasoning, math, and commonsense tasks that reflect your real-world use cases:
152+
GSM8K – Arithmetic and logical reasoning (sensitive to quantization).
153+
Winogrande – Commonsense and pronoun disambiguation.
154+
ARC-Easy / ARC-Challenge – Science and multi-step reasoning questions.
155+
Running multiple benchmarks gives a more comprehensive picture of model robustness under different workloads.
156+
157+
* Experiment with different quantization configurations to find the best accuracy–throughput trade-off for your hardware.
118158
* Record both throughput and accuracy to choose the best configuration for your workload.
159+
160+
By iterating on these steps, you will build a custom performance and accuracy profile for your Arm deployment, helping you select the optimal quantization strategy and runtime configuration for your target workload.

0 commit comments

Comments
 (0)