Skip to content

Commit 617b726

Browse files
Revise quantization guide for LLMs on Arm platform
Updated the title and section headings for clarity. Revised the quantization parameter tuning section for better guidance on model optimization.
1 parent 927b1f5 commit 617b726

File tree

1 file changed

+12
-8
lines changed

1 file changed

+12
-8
lines changed

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
2-
title: Quantize an LLM to INT4 for Arm Platform
2+
title: Quantize an LLM to INT4
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
## Accelerating LLMs with 4-bit Quantization
8+
## Accelerate LLMs with 4-bit quantization
99

1010
You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this section, you’ll quantize the deepseek-ai/DeepSeek-V2-Lite model to 4-bit integer (INT4) weights.
1111
The quantized model runs efficiently through vLLM’s INT4 inference path, which is accelerated by Arm KleidiAI microkernels.
@@ -35,7 +35,7 @@ If the model you plan to quantize is gated on Hugging Face (e.g., DeepSeek or pr
3535
huggingface-cli login
3636
```
3737

38-
## INT4 Quantization Recipe
38+
## Apply the INT4 quantization recipe
3939

4040
Using a file editor of your choice, save the following code into a file named `quantize_vllm_models.py`:
4141

@@ -134,12 +134,16 @@ This script creates a Arm KleidiAI INT4 quantized copy of the vLLM model and sav
134134

135135
## Quantize DeepSeek‑V2‑Lite model
136136

137-
### Quantization parameter tuning
138-
Quantization parameters determine how the model’s floating-point weights and activations are converted into lower-precision integer formats. Choosing the right combination is essential for balancing model accuracy, memory footprint, and runtime throughput on Arm CPUs.
137+
Quantizing your model to INT4 format significantly reduces memory usage and improves inference speed on Arm CPUs. In this section, you'll apply the quantization script to the DeepSeek‑V2‑Lite model, tuning key parameters for optimal performance and accuracy. This process prepares your model for efficient deployment with vLLM on Arm-based servers.
139138

140-
1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method.
141-
2. `channelwise` is a good default for most models.
142-
3. `groupwise` can improve accuracy further; `--groupsize 32` is common.
139+
## Tune quantization parameters
140+
Quantization parameters control how the model’s floating-point weights and activations are converted to lower-precision integer formats. The right settings help you balance accuracy, memory usage, and performance on Arm CPUs.
141+
142+
- Use `minmax` for faster quantization, or `mse` for higher accuracy (but slower)
143+
- Choose `channelwise` for most models; it’s a reliable default
144+
- Try `groupwise` for potentially better accuracy; `--groupsize 32` is a common choice
145+
146+
Pick the combination that fits your accuracy and speed needs.
143147

144148
Execute the following command to quantize the DeepSeek-V2-Lite model:
145149

0 commit comments

Comments
 (0)