Skip to content

Commit 10b5c72

Browse files
committed
correct
Signed-off-by: yiliu30 <[email protected]>
1 parent 266e1eb commit 10b5c72

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

_posts/2025-11-27-intel-autoround-llmc.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Broader quantization schemes and model coverage are coming next—try it now and
1919

2020
## What Is AutoRound?
2121

22-
**AutoRound** is an advanced post-training quantization (PTQ) algorithm designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It introduces three trainable parameters per quantized tensor:`V` (rounding offset/adjustment),`α` and `β` (learned clipping range controls). By processing decoder layers sequentially and applying signed gradient descent, AutoRound jointly optimizes rounding and clipping to minimize block‑wise output reconstruction error.
22+
**AutoRound** is an advanced post-training quantization (PTQ) algorithm designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It introduces three trainable parameters per quantized tensor: `V` (rounding offset/adjustment), `α` and `β` (learned clipping range controls). By processing decoder layers sequentially and applying signed gradient descent, AutoRound jointly optimizes rounding and clipping to minimize block‑wise output reconstruction error.
2323

2424
Core strengths:
2525

@@ -28,7 +28,7 @@ Core strengths:
2828
- **Mixed‑bit**, layer‑wise precision search for flexible accuracy–efficiency trade‑offs
2929
- Applicability across both **LLMs** and **VLMs**
3030

31-
AutoRound enables quantized models in a range of low‑bit formats that are designed to accelerate inference on **Intel®** **Xeon****® processors**, **Intel® Gaudi® AI accelerators**, **Intel® Data Center GPUs**, **Intel® Arc™ B‑Series Graphics**, as well as other GPUs (e.g., CUDA‑based devices).
31+
AutoRound enables quantized models in a range of low‑bit formats that are designed to accelerate inference on **Intel® Xeon ® processors**, **Intel® Gaudi® AI accelerators**, **Intel® Data Center GPUs**, **Intel® Arc™ B‑Series Graphics**, as well as other GPUs (e.g., CUDA‑based devices).
3232

3333
Looking forward, as Intel’s next‑generation GPUs—**including Intel® Crescent Island**—add native support for **FP8, MXFP8, and MXFP4** formats, models optimized with AutoRound will naturally scale to take advantage of these data types across the Intel AI hardware portfolio. This creates a consistent path from algorithmic innovation to real‑world deployment.
3434

@@ -51,15 +51,15 @@ We completed the first stage of integration by introducing the new `AutoRoundMod
5151

5252
### 1. Install
5353

54-
```Bash
54+
```bash
5555
git clone https://github.com/vllm-project/llm-compressor.git
5656
cd llm-compressor
5757
pip install -e .
5858
```
5959

6060
### 2. Load Model & Tokenizer
6161

62-
```Python
62+
```python
6363
from transformers import AutoModelForCausalLM, AutoTokenizer
6464
MODEL_ID = "Qwen/Qwen3-8B"
6565
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
@@ -68,7 +68,7 @@ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
6868

6969
### 3. Prepare Calibration Data
7070

71-
```Python
71+
```python
7272
from auto_round.calib_dataset import get_dataset
7373
NUM_CALIBRATION_SAMPLES = 128
7474
MAX_SEQUENCE_LENGTH = 2048
@@ -81,7 +81,7 @@ ds = get_dataset(tokenizer=tokenizer,
8181

8282
The AutoRound quantization can run on a variety of devices, including CPUs and GPUs. Quantization and serving may not happen on the same device. For example, you can quantize on a workstation with GPU and later deploy on AIPC.
8383

84-
```Python
84+
```python
8585
from llmcompressor import oneshot
8686
from llmcompressor.modifiers.autoround import AutoRoundModifier
8787

@@ -112,9 +112,9 @@ In practice, **128 calibration samples + ~200 iterations** often reach stable co
112112

113113
### 5. Serve in vLLM
114114

115-
Once quantization is complete, the same compressed model can be served on different hardware, independent of the device used for tuning. For example, you can serve the quantized Qwen3‑8B‑W4A16‑G128‑AutoRound model on a single **Intel®** **Arc****™ Pro B60** **GPU**:
115+
Once quantization is complete, the same compressed model can be served on different hardware, independent of the device used for tuning. For example, you can serve the quantized Qwen3‑8B‑W4A16‑G128‑AutoRound model on a single **Intel® Arc™ Pro B60 GPU**:
116116

117-
```Bash
117+
```bash
118118
vllm serve Qwen3-8B-W4A16-G128-AutoRound \
119119
--dtype=bfloat16 \
120120
--enforce-eager \
@@ -127,9 +127,9 @@ Note: please install vLLM from this PR https://github.com/vllm-project/vllm/pull
127127

128128
### 6. Evaluate (Example: GSM8K with `lm_eval`)
129129

130-
```Bash
130+
```bash
131131
lm_eval --model vllm \
132-
--model_args pretrained="./Qwen3-8B-W4A16-G128-AutoRound,add_bos_token=truemax_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enforce_eager=True" \
132+
--model_args pretrained="./Qwen3-8B-W4A16-G128-AutoRound,add_bos_token=true,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enforce_eager=True" \
133133
--tasks gsm8k \
134134
--num_fewshot 5 \
135135
--limit 1000 \
@@ -142,9 +142,9 @@ lm_eval --model vllm \
142142

143143
## Conclusion & Future Plans
144144

145-
With this first integration, AutoRound and LLM Compressor already provide a practical, production‑oriented path to low‑bit LLMs: W4A16 quantization is supported end‑to‑end, the workflow is simple to configure, and dense models such as Llama and Qwen. The setup is robust, streamlined, and ready for practical deployment.
145+
With this first integration, AutoRound and LLM Compressor already provide a practical, production‑oriented path to low‑bit LLMs: W4A16 quantization is supported end‑to‑end, the workflow is simple to configure, and dense models such as Llama and Qwen are supported. The setup is robust, streamlined, and ready for practical deployment.
146146

147-
Looking ahead, we plan to extend support to additional schemes such as FP8, MXFP4, MXFP8, and NVFP4, add automatic mixed‑bit search for fine‑grained per‑layer optimization, and cover more model families, including Mixture‑of‑Experts (MoE) models. We also aim to deepen interoperability with other algorithms in LLM Compressor. So AutoRound can be combined into richer multi‑modifier recipes that serve both community use cases and Intel production workloads.
147+
Looking ahead, we plan to extend support to additional schemes such as FP8, MXFP4, MXFP8, and NVFP4, add automatic mixed‑bit search for fine‑grained per‑layer optimization, and cover more model families, including Mixture‑of‑Experts (MoE) models. We also aim to deepen interoperability with other algorithms in LLM Compressor, which will allow AutoRound to combined into richer multi‑modifier recipes that serve both community use cases and Intel production workloads.
148148

149149
If you’d like to influence which formats, models, and workflows we prioritize next, please join the discussion in [RFC #1968](https://github.com/vllm-project/llm-compressor/issues/1968) and share your benchmarks or deployment requirements, or bring your feedback to the Intel Community so we can align the roadmap with real‑world needs.
150150

0 commit comments

Comments
 (0)