Skip to content

Commit 7d7fc97

Browse files
committed
[feat]: add vllm accuracy benchmarking flow
Signed-off-by: Nikhil Gupta <[email protected]>
1 parent c7cf14a commit 7d7fc97

File tree

2 files changed

+120
-1
lines changed

2 files changed

+120
-1
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
title: Evaluate Accuracy with LM Evaluation Harness
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Why accuracy benchmarking
10+
11+
The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example, MMLU, HellaSwag, GSM8K) and runtimes (Hugging Face, vLLM, llama.cpp, etc.). In this module, you will run accuracy tests for both BF16 and INT4 deployments of your model served by vLLM on Arm-based servers.
12+
13+
You will:
14+
* Install lm-eval-harness with vLLM support
15+
* Run benchmarks on a BF16 model and an INT4 (weight-quantized) model
16+
* Interpret key metrics and compare quality across precisions
17+
18+
{{% notice Note %}}
19+
Results depend on CPU, dataset versions, and model choice. Use the same tasks and few-shot settings when comparing BF16 and INT4 to ensure a fair comparison.
20+
{{% /notice %}}
21+
22+
## Prerequisites
23+
24+
Before you start:
25+
* Complete the optimized build in “Overview and Optimized Build” and validate your vLLM install.
26+
* Optionally quantize a model using the “Quantize an LLM to INT4 for Arm Platform” module. We’ll reference the output directory name from that step.
27+
28+
## Install lm-eval-harness
29+
30+
Install the harness with vLLM extras in your active Python environment:
31+
32+
```bash
33+
pip install "lm_eval[vllm]"
34+
pip install ray
35+
```
36+
37+
{{% notice Tip %}}
38+
If your benchmarks include gated models or datasets, run `huggingface-cli login` first so the harness can download what it needs.
39+
{{% /notice %}}
40+
41+
## Recommended runtime settings for Arm CPU
42+
43+
Export the same performance-oriented environment variables used for serving. These enable Arm-optimized kernels through oneDNN+ACL and consistent thread pinning:
44+
45+
```bash
46+
export VLLM_TARGET_DEVICE=cpu
47+
export VLLM_CPU_KVCACHE_SPACE=32
48+
export VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))"
49+
export VLLM_MLA_DISABLE=1
50+
export ONEDNN_DEFAULT_FPMATH_MODE=BF16
51+
export OMP_NUM_THREADS="$(nproc)"
52+
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4
53+
```
54+
55+
{{% notice Note %}}
56+
`LD_PRELOAD` uses tcmalloc to reduce allocator contention. Install it via `sudo apt-get install -y libtcmalloc-minimal4` if you haven’t already.
57+
{{% /notice %}}
58+
59+
## Accuracy Benchmarking Meta‑Llama‑3.1‑8B‑Instruct BF16 model
60+
61+
Run with a non-quantized model. Replace the model ID as needed.
62+
63+
```bash
64+
lm_eval \
65+
--model vllm \
66+
--model_args \
67+
pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16,max_model_len=4096,enforce_eager=True \
68+
--tasks mmlu,hellaswag \
69+
--batch_size auto \
70+
--output_path results
71+
```
72+
73+
## Accuracy Benchmarking INT4 quantized model
74+
75+
Use the INT4 quantization recipe & script from previous steps to quantize `meta-llama/Meta-Llama-3.1-8B-Instruct` model
76+
77+
Channelwise INT4 (MSE):
78+
79+
```bash
80+
lm_eval \
81+
--model vllm \
82+
--model_args \
83+
pretrained=Meta-Llama-3.1-8B-Instruct-w4a8dyn-mse-channelwise,dtype=float32,max_model_len=4096,enforce_eager=True \
84+
--tasks mmlu,hellaswag \
85+
--batch_size auto \
86+
--output_path results
87+
```
88+
89+
## Interpreting results
90+
91+
The harness prints per-task and aggregate scores (for example, `acc`, `acc_norm`, `exact_match`). Higher is generally better. Compare BF16 vs INT4 on the same tasks to assess quality impact.
92+
93+
Practical tips:
94+
* Use the same tasks and few-shot settings across runs.
95+
* For quick iteration, you can add `--limit 200` to run on a subset.
96+
97+
## Example results for Meta‑Llama‑3.1‑8B‑Instruct model
98+
99+
These illustrative results are representative; actual scores may vary across hardware, dataset versions, and harness releases. Higher values indicate better accuracy.
100+
101+
| Variant | MMLU (acc±err) | HellaSwag (acc±err) |
102+
|---------------------------------|-------------------|---------------------|
103+
| BF16 | 0.5897 ± 0.0049 | 0.7916 ± 0.0041 |
104+
| INT4 Groupwise minmax (G=32) | 0.5831 ± 0.0049 | 0.7819 ± 0.0041 |
105+
| INT4 Channelwise MSE | 0.5712 ± 0.0049 | 0.7633 ± 0.0042 |
106+
107+
Use these as ballpark expectations to check whether your runs are in a reasonable range, not as official targets.
108+
109+
## Next steps
110+
111+
* Try additional tasks to match your usecase: `gsm8k`, `winogrande`, `arc_easy`, `arc_challenge`.
112+
* Sweep quantization recipes (minmax vs mse; channelwise vs groupwise, group size) to find a better accuracy/performance balance.
113+
* Record both throughput and accuracy to choose the best configuration for your workload.

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,15 @@ cascade:
77

88
minutes_to_complete: 60
99

10-
who_is_this_for: This learning path is designed for software developers and AI engineers who want to build and optimize vLLM for Arm-based servers, quantize large language models (LLMs) to INT4, and serve them efficiently through an OpenAI-compatible API.
10+
who_is_this_for: This learning path is designed for software developers and AI engineers who want to build and optimize vLLM for Arm-based servers, quantize large language models (LLMs) to INT4, serve them efficiently through an OpenAI-compatible API, and benchmark model accuracy using the LM Evaluation Harness.
1111

1212
learning_objectives:
1313
- Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library(ACL).
1414
- Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries.
1515
- Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision.
1616
- Run and serve both quantized and BF16 (non-quantized) variants using vLLM.
1717
- Use OpenAI‑compatible endpoints and understand sequence and batch limits.
18+
- Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM.
1819

1920
prerequisites:
2021
- An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space.
@@ -32,6 +33,7 @@ operatingsystems:
3233
- Linux
3334
tools_software_languages:
3435
- vLLM
36+
- LM Evaluation Harness
3537
- LLM
3638
- Generative AI
3739
- Python
@@ -54,6 +56,10 @@ further_reading:
5456
title: Build and Run vLLM on Arm Servers
5557
link: /learning-paths/servers-and-cloud-computing/vllm/
5658
type: website
59+
- resource:
60+
title: LM Evaluation Harness (GitHub)
61+
link: https://github.com/EleutherAI/lm-evaluation-harness
62+
type: github
5763

5864

5965

0 commit comments

Comments
 (0)