Skip to content

Commit 266e1eb

Browse files
committed
add draft
Signed-off-by: yiliu30 <[email protected]>
1 parent 049dd09 commit 266e1eb

File tree

1 file changed

+164
-0
lines changed

1 file changed

+164
-0
lines changed
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
---
2+
layout: post
3+
title: "Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor [Draft]"
4+
author: "Intel Neural Compressor Team"
5+
image: /assets/figures/2025-vllm-on-intel-arc/perf-figure1.png
6+
---
7+
8+
9+
## TL;DR
10+
11+
We’re excited to announce that **[AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf)**—Intel’s state‑of‑the‑art tuning‑based post‑training quantization (PTQ) algorithm—is now integrated into **[LLM Compressor](https://github.com/vllm-project/llm-compressor)**. This collaboration delivers:
12+
13+
- Higher accuracy for low bit-width quantization
14+
- Lightweight tuning (hundreds of steps, not thousands)
15+
- Zero additional inference overhead
16+
- Seamless compatibility with `compressed-tensors` and direct serving in [vLLM](https://github.com/vllm-project/vllm)
17+
18+
Broader quantization schemes and model coverage are coming next—try it now and help shape what we build.
19+
20+
## What Is AutoRound?
21+
22+
**AutoRound** is an advanced post-training quantization (PTQ) algorithm designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It introduces three trainable parameters per quantized tensor:`V` (rounding offset/adjustment),`α` and `β` (learned clipping range controls). By processing decoder layers sequentially and applying signed gradient descent, AutoRound jointly optimizes rounding and clipping to minimize block‑wise output reconstruction error.
23+
24+
Core strengths:
25+
26+
- **Superior accuracy**, especially at very low bit‑widths
27+
- **Support multiple data types:** W4A16, MXFP8, MXFP4, FP8, NVFP4, with more on the way
28+
- **Mixed‑bit**, layer‑wise precision search for flexible accuracy–efficiency trade‑offs
29+
- Applicability across both **LLMs** and **VLMs**
30+
31+
AutoRound enables quantized models in a range of low‑bit formats that are designed to accelerate inference on **Intel®** **Xeon****® processors**, **Intel® Gaudi® AI accelerators**, **Intel® Data Center GPUs**, **Intel® Arc™ B‑Series Graphics**, as well as other GPUs (e.g., CUDA‑based devices).
32+
33+
Looking forward, as Intel’s next‑generation GPUs—**including Intel® Crescent Island**—add native support for **FP8, MXFP8, and MXFP4** formats, models optimized with AutoRound will naturally scale to take advantage of these data types across the Intel AI hardware portfolio. This creates a consistent path from algorithmic innovation to real‑world deployment.
34+
35+
For more details, please refer to the paper [AutoRound (EMNLP 2024)](https://aclanthology.org/2024.findings-emnlp.662.pdf) and the GitHub repository [intel/auto-round](https://github.com/intel/auto-round).
36+
37+
## Why Integrate Into LLM Compressor?
38+
39+
**LLM** **Compressor** already provides a unified, modular system for compression primitives such as quantization, pruning, and distillation. Integrating AutoRound into this ecosystem:
40+
41+
- Aligns with the existing modifier architecture (e.g., `GPTQModifier`)
42+
- Reuses the sequential calibration and layer‑onloading infrastructure
43+
- Enables future interoperability with richer multi‑modifier recipes
44+
- Produces quantized models that are ready for vLLM serving, enabling a clean workflow from compression to deployment
45+
46+
## Integration Overview
47+
48+
We completed the first stage of integration by introducing the new `AutoRoundModifier` into LLM Compressor, enabling production of `wNa16` (e.g., W4A16) compressed models that seamlessly load in vLLM, as implemented in [PR #1994](https://github.com/vllm-project/llm-compressor/pull/1994). With a straightforward configuration—just specify your model and calibration data—you can quickly generate high‑quality low‑bit checkpoints. This initial stage supports quantizing a range of dense LLMs, including the **Llama** and **Qwen** model families, and demonstrates robust compatibility for practical deployment.
49+
50+
## Try It Now (Quickstart)
51+
52+
### 1. Install
53+
54+
```Bash
55+
git clone https://github.com/vllm-project/llm-compressor.git
56+
cd llm-compressor
57+
pip install -e .
58+
```
59+
60+
### 2. Load Model & Tokenizer
61+
62+
```Python
63+
from transformers import AutoModelForCausalLM, AutoTokenizer
64+
MODEL_ID = "Qwen/Qwen3-8B"
65+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
66+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
67+
```
68+
69+
### 3. Prepare Calibration Data
70+
71+
```Python
72+
from auto_round.calib_dataset import get_dataset
73+
NUM_CALIBRATION_SAMPLES = 128
74+
MAX_SEQUENCE_LENGTH = 2048
75+
ds = get_dataset(tokenizer=tokenizer,
76+
seqlen=MAX_SEQUENCE_LENGTH,
77+
nsamples=NUM_CALIBRATION_SAMPLES)
78+
```
79+
80+
### 4. Run Quantization using AutoRound
81+
82+
The AutoRound quantization can run on a variety of devices, including CPUs and GPUs. Quantization and serving may not happen on the same device. For example, you can quantize on a workstation with GPU and later deploy on AIPC.
83+
84+
```Python
85+
from llmcompressor import oneshot
86+
from llmcompressor.modifiers.autoround import AutoRoundModifier
87+
88+
recipe = AutoRoundModifier(
89+
targets="Linear",
90+
scheme="W4A16",
91+
ignore=["lm_head"],
92+
iters=200,
93+
enable_torch_compile=False,
94+
batch_size=2,
95+
)
96+
97+
oneshot(
98+
model=model,
99+
dataset=ds,
100+
recipe=recipe,
101+
max_seq_length=MAX_SEQUENCE_LENGTH,
102+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
103+
shuffle_calibration_samples=False,
104+
)
105+
106+
SAVE_DIR = MODEL_ID.split("/")[-1] + "-W4A16-G128-AutoRound"
107+
model.save_pretrained(SAVE_DIR, save_compressed=True)
108+
tokenizer.save_pretrained(SAVE_DIR)
109+
```
110+
111+
In practice, **128 calibration samples + ~200 iterations** often reach stable convergence. Increase the number of samples or iterations if you are targeting extremely low bits or tighter accuracy targets.
112+
113+
### 5. Serve in vLLM
114+
115+
Once quantization is complete, the same compressed model can be served on different hardware, independent of the device used for tuning. For example, you can serve the quantized Qwen3‑8B‑W4A16‑G128‑AutoRound model on a single **Intel®** **Arc****™ Pro B60** **GPU**:
116+
117+
```Bash
118+
vllm serve Qwen3-8B-W4A16-G128-AutoRound \
119+
--dtype=bfloat16 \
120+
--enforce-eager \
121+
--gpu-memory-util=0.8 \
122+
--no-enable-prefix-caching \
123+
--max-num-batched-tokens=8192
124+
```
125+
126+
Note: please install vLLM from this PR https://github.com/vllm-project/vllm/pull/29484/
127+
128+
### 6. Evaluate (Example: GSM8K with `lm_eval`)
129+
130+
```Bash
131+
lm_eval --model vllm \
132+
--model_args pretrained="./Qwen3-8B-W4A16-G128-AutoRound,add_bos_token=truemax_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enforce_eager=True" \
133+
--tasks gsm8k \
134+
--num_fewshot 5 \
135+
--limit 1000 \
136+
--batch_size 'auto'
137+
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
138+
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
139+
|gsm8k| 3|flexible-extract| 5|exact_match||0.908|± |0.0091|
140+
| | |strict-match | 5|exact_match||0.907|± |0.0092|
141+
```
142+
143+
## Conclusion & Future Plans
144+
145+
With this first integration, AutoRound and LLM Compressor already provide a practical, production‑oriented path to low‑bit LLMs: W4A16 quantization is supported end‑to‑end, the workflow is simple to configure, and dense models such as Llama and Qwen. The setup is robust, streamlined, and ready for practical deployment.
146+
147+
Looking ahead, we plan to extend support to additional schemes such as FP8, MXFP4, MXFP8, and NVFP4, add automatic mixed‑bit search for fine‑grained per‑layer optimization, and cover more model families, including Mixture‑of‑Experts (MoE) models. We also aim to deepen interoperability with other algorithms in LLM Compressor. So AutoRound can be combined into richer multi‑modifier recipes that serve both community use cases and Intel production workloads.
148+
149+
If you’d like to influence which formats, models, and workflows we prioritize next, please join the discussion in [RFC #1968](https://github.com/vllm-project/llm-compressor/issues/1968) and share your benchmarks or deployment requirements, or bring your feedback to the Intel Community so we can align the roadmap with real‑world needs.
150+
151+
### Acknowledgements
152+
153+
We’d like to thank the **vLLM / LLM Compressor** community for extensive early discussions on the proposal and for their thoughtful reviews of the pull requests.
154+
155+
#### Related RFCs and PRs
156+
157+
RFC: https://github.com/vllm-project/llm-compressor/issues/1968
158+
159+
PRs:
160+
161+
- https://github.com/vllm-project/llm-compressor/pull/1994
162+
- https://github.com/vllm-project/llm-compressor/pull/2055
163+
- https://github.com/vllm-project/llm-compressor/pull/2062 (Under Review)
164+
- https://github.com/vllm-project/vllm/pull/29484/ (Under Review)

0 commit comments

Comments
 (0)