Skip to content

Commit 01b3d02

Browse files
authored
Update 4-accuracy-benchmarking.md
1 parent a620450 commit 01b3d02

File tree

1 file changed

+13
-8
lines changed

1 file changed

+13
-8
lines changed

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/4-accuracy-benchmarking.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,22 +8,27 @@ layout: learningpathall
88

99
## Why accuracy benchmarking
1010

11-
The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example, MMLU, HellaSwag, GSM8K) and runtimes (Hugging Face, vLLM, llama.cpp, etc.). In this module, you will run accuracy tests for both BF16 and INT4 deployments of your model served by vLLM on Arm-based servers.
11+
The LM Evaluation Harness (lm-eval-harness) is a widely used open-source framework for evaluating the accuracy of large language models on standardized academic benchmarks such as MMLU, HellaSwag, and GSM8K.
12+
It provides a consistent interface for evaluating models served through various runtimes—such as Hugging Face Transformers, vLLM, or llama.cpp using the same datasets, few-shot templates, and scoring metrics.
13+
In this module, you will measure how quantization impacts model quality by comparing BF16 (non-quantized) and INT4 (quantized) versions of your model running on Arm-based servers.
1214

1315
You will:
14-
* Install lm-eval-harness with vLLM support
15-
* Run benchmarks on a BF16 model and an INT4 (weight-quantized) model
16-
* Interpret key metrics and compare quality across precisions
16+
* Install lm-eval-harness with vLLM backend support.
17+
* Run benchmark tasks on both BF16 and INT4 model deployments.
18+
* Analyze and interpret accuracy differences between the two precisions.
1719

1820
{{% notice Note %}}
19-
Results depend on CPU, dataset versions, and model choice. Use the same tasks and few-shot settings when comparing BF16 and INT4 to ensure a fair comparison.
21+
Accuracy results can vary depending on CPU, dataset versions, and model choice. Use the same tasks, few-shot settings and evaluation batch size when comparing BF16 and INT4 results to ensure a fair comparison.
2022
{{% /notice %}}
2123

2224
## Prerequisites
2325

24-
Before you start:
25-
* Complete the optimized build in “Overview and Optimized Build” and validate your vLLM install.
26-
* Optionally quantize a model using the “Quantize an LLM to INT4 for Arm Platform” module. We’ll reference the output directory name from that step.
26+
Before you begin, make sure your environment is ready for evaluation.
27+
You should have:
28+
* Completed the optimized build from the “Overview and Optimized Build” section and successfully validated your vLLM installation.
29+
* (Optional) Quantized a model using the “Quantize an LLM to INT4 for Arm Platform” module.
30+
The quantized model directory (for example, DeepSeek-V2-Lite-w4a8dyn-mse-channelwise) will be used as input for INT4 evaluation.
31+
If you haven’t quantized a model, you can still evaluate your BF16 baseline to establish a reference accuracy.
2732

2833
## Install lm-eval-harness
2934

0 commit comments

Comments
 (0)