|
1 | 1 | --- |
2 | | -title: Quantize an LLM to INT4 for Arm Platform |
| 2 | +title: Quantize an LLM to INT4 |
3 | 3 | weight: 3 |
4 | 4 |
|
5 | 5 | ### FIXED, DO NOT MODIFY |
6 | 6 | layout: learningpathall |
7 | 7 | --- |
8 | | -## Accelerating LLMs with 4-bit Quantization |
| 8 | +## Accelerate LLMs with 4-bit quantization |
9 | 9 |
|
10 | 10 | You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this section, you’ll quantize the deepseek-ai/DeepSeek-V2-Lite model to 4-bit integer (INT4) weights. |
11 | 11 | The quantized model runs efficiently through vLLM’s INT4 inference path, which is accelerated by Arm KleidiAI microkernels. |
@@ -35,7 +35,7 @@ If the model you plan to quantize is gated on Hugging Face (e.g., DeepSeek or pr |
35 | 35 | huggingface-cli login |
36 | 36 | ``` |
37 | 37 |
|
38 | | -## INT4 Quantization Recipe |
| 38 | +## Apply the INT4 quantization recipe |
39 | 39 |
|
40 | 40 | Using a file editor of your choice, save the following code into a file named `quantize_vllm_models.py`: |
41 | 41 |
|
@@ -134,12 +134,16 @@ This script creates a Arm KleidiAI INT4 quantized copy of the vLLM model and sav |
134 | 134 |
|
135 | 135 | ## Quantize DeepSeek‑V2‑Lite model |
136 | 136 |
|
137 | | -### Quantization parameter tuning |
138 | | -Quantization parameters determine how the model’s floating-point weights and activations are converted into lower-precision integer formats. Choosing the right combination is essential for balancing model accuracy, memory footprint, and runtime throughput on Arm CPUs. |
| 137 | +Quantizing your model to INT4 format significantly reduces memory usage and improves inference speed on Arm CPUs. In this section, you'll apply the quantization script to the DeepSeek‑V2‑Lite model, tuning key parameters for optimal performance and accuracy. This process prepares your model for efficient deployment with vLLM on Arm-based servers. |
139 | 138 |
|
140 | | -1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method. |
141 | | -2. `channelwise` is a good default for most models. |
142 | | -3. `groupwise` can improve accuracy further; `--groupsize 32` is common. |
| 139 | +## Tune quantization parameters |
| 140 | +Quantization parameters control how the model’s floating-point weights and activations are converted to lower-precision integer formats. The right settings help you balance accuracy, memory usage, and performance on Arm CPUs. |
| 141 | + |
| 142 | +- Use `minmax` for faster quantization, or `mse` for higher accuracy (but slower) |
| 143 | +- Choose `channelwise` for most models; it’s a reliable default |
| 144 | +- Try `groupwise` for potentially better accuracy; `--groupsize 32` is a common choice |
| 145 | + |
| 146 | +Pick the combination that fits your accuracy and speed needs. |
143 | 147 |
|
144 | 148 | Execute the following command to quantize the DeepSeek-V2-Lite model: |
145 | 149 |
|
|
0 commit comments