You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/vllm-acceleration/4-accuracy-benchmarking.md
+13-8Lines changed: 13 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,22 +8,27 @@ layout: learningpathall
8
8
9
9
## Why accuracy benchmarking
10
10
11
-
The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example, MMLU, HellaSwag, GSM8K) and runtimes (Hugging Face, vLLM, llama.cpp, etc.). In this module, you will run accuracy tests for both BF16 and INT4 deployments of your model served by vLLM on Arm-based servers.
11
+
The LM Evaluation Harness (lm-eval-harness) is a widely used open-source framework for evaluating the accuracy of large language models on standardized academic benchmarks such as MMLU, HellaSwag, and GSM8K.
12
+
It provides a consistent interface for evaluating models served through various runtimes—such as Hugging Face Transformers, vLLM, or llama.cpp using the same datasets, few-shot templates, and scoring metrics.
13
+
In this module, you will measure how quantization impacts model quality by comparing BF16 (non-quantized) and INT4 (quantized) versions of your model running on Arm-based servers.
12
14
13
15
You will:
14
-
* Install lm-eval-harness with vLLM support
15
-
* Run benchmarks on a BF16 model and an INT4 (weight-quantized) model
16
-
*Interpret key metrics and compare quality across precisions
16
+
* Install lm-eval-harness with vLLM backend support.
17
+
* Run benchmark tasks on both BF16 and INT4 model deployments.
18
+
*Analyze and interpret accuracy differences between the two precisions.
17
19
18
20
{{% notice Note %}}
19
-
Results depend on CPU, dataset versions, and model choice. Use the same tasks and few-shot settings when comparing BF16 and INT4 to ensure a fair comparison.
21
+
Accuracy results can vary depending on CPU, dataset versions, and model choice. Use the same tasks, few-shot settings and evaluation batch size when comparing BF16 and INT4 results to ensure a fair comparison.
20
22
{{% /notice %}}
21
23
22
24
## Prerequisites
23
25
24
-
Before you start:
25
-
* Complete the optimized build in “Overview and Optimized Build” and validate your vLLM install.
26
-
* Optionally quantize a model using the “Quantize an LLM to INT4 for Arm Platform” module. We’ll reference the output directory name from that step.
26
+
Before you begin, make sure your environment is ready for evaluation.
27
+
You should have:
28
+
* Completed the optimized build from the “Overview and Optimized Build” section and successfully validated your vLLM installation.
29
+
* (Optional) Quantized a model using the “Quantize an LLM to INT4 for Arm Platform” module.
30
+
The quantized model directory (for example, DeepSeek-V2-Lite-w4a8dyn-mse-channelwise) will be used as input for INT4 evaluation.
31
+
If you haven’t quantized a model, you can still evaluate your BF16 baseline to establish a reference accuracy.
0 commit comments