From 266e1eb4f25f974e84b032850f446fa16a172256 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 27 Nov 2025 01:32:03 +0000 Subject: [PATCH 01/17] add draft Signed-off-by: yiliu30 --- _posts/2025-11-27-intel-autoround-llmc.md | 164 ++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100755 _posts/2025-11-27-intel-autoround-llmc.md diff --git a/_posts/2025-11-27-intel-autoround-llmc.md b/_posts/2025-11-27-intel-autoround-llmc.md new file mode 100755 index 0000000..58f06e9 --- /dev/null +++ b/_posts/2025-11-27-intel-autoround-llmc.md @@ -0,0 +1,164 @@ +--- +layout: post +title: "Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor [Draft]" +author: "Intel Neural Compressor Team" +image: /assets/figures/2025-vllm-on-intel-arc/perf-figure1.png +--- + + +## TL;DR + +We’re excited to announce that **[AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf)**—Intel’s state‑of‑the‑art tuning‑based post‑training quantization (PTQ) algorithm—is now integrated into **[LLM Compressor](https://github.com/vllm-project/llm-compressor)**. This collaboration delivers: + +- Higher accuracy for low bit-width quantization +- Lightweight tuning (hundreds of steps, not thousands) +- Zero additional inference overhead +- Seamless compatibility with `compressed-tensors` and direct serving in [vLLM](https://github.com/vllm-project/vllm) + +Broader quantization schemes and model coverage are coming next—try it now and help shape what we build. + +## What Is AutoRound? + +**AutoRound** is an advanced post-training quantization (PTQ) algorithm designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It introduces three trainable parameters per quantized tensor:`V` (rounding offset/adjustment),`α` and `β` (learned clipping range controls). By processing decoder layers sequentially and applying signed gradient descent, AutoRound jointly optimizes rounding and clipping to minimize block‑wise output reconstruction error. + +Core strengths: + +- **Superior accuracy**, especially at very low bit‑widths +- **Support multiple data types:** W4A16, MXFP8, MXFP4, FP8, NVFP4, with more on the way +- **Mixed‑bit**, layer‑wise precision search for flexible accuracy–efficiency trade‑offs +- Applicability across both **LLMs** and **VLMs** + +AutoRound enables quantized models in a range of low‑bit formats that are designed to accelerate inference on **Intel®** **Xeon****® processors**, **Intel® Gaudi® AI accelerators**, **Intel® Data Center GPUs**, **Intel® Arc™ B‑Series Graphics**, as well as other GPUs (e.g., CUDA‑based devices). + +Looking forward, as Intel’s next‑generation GPUs—**including Intel® Crescent Island**—add native support for **FP8, MXFP8, and MXFP4** formats, models optimized with AutoRound will naturally scale to take advantage of these data types across the Intel AI hardware portfolio. This creates a consistent path from algorithmic innovation to real‑world deployment. + +For more details, please refer to the paper [AutoRound (EMNLP 2024)](https://aclanthology.org/2024.findings-emnlp.662.pdf) and the GitHub repository [intel/auto-round](https://github.com/intel/auto-round). + +## Why Integrate Into LLM Compressor? + +**LLM** **Compressor** already provides a unified, modular system for compression primitives such as quantization, pruning, and distillation. Integrating AutoRound into this ecosystem: + +- Aligns with the existing modifier architecture (e.g., `GPTQModifier`) +- Reuses the sequential calibration and layer‑onloading infrastructure +- Enables future interoperability with richer multi‑modifier recipes +- Produces quantized models that are ready for vLLM serving, enabling a clean workflow from compression to deployment + +## Integration Overview + +We completed the first stage of integration by introducing the new `AutoRoundModifier` into LLM Compressor, enabling production of `wNa16` (e.g., W4A16) compressed models that seamlessly load in vLLM, as implemented in [PR #1994](https://github.com/vllm-project/llm-compressor/pull/1994). With a straightforward configuration—just specify your model and calibration data—you can quickly generate high‑quality low‑bit checkpoints. This initial stage supports quantizing a range of dense LLMs, including the **Llama** and **Qwen** model families, and demonstrates robust compatibility for practical deployment. + +## Try It Now (Quickstart) + +### 1. Install + +```Bash +git clone https://github.com/vllm-project/llm-compressor.git +cd llm-compressor +pip install -e . +``` + +### 2. Load Model & Tokenizer + +```Python +from transformers import AutoModelForCausalLM, AutoTokenizer +MODEL_ID = "Qwen/Qwen3-8B" +model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) +``` + +### 3. Prepare Calibration Data + +```Python +from auto_round.calib_dataset import get_dataset +NUM_CALIBRATION_SAMPLES = 128 +MAX_SEQUENCE_LENGTH = 2048 +ds = get_dataset(tokenizer=tokenizer, + seqlen=MAX_SEQUENCE_LENGTH, + nsamples=NUM_CALIBRATION_SAMPLES) +``` + +### 4. Run Quantization using AutoRound + +The AutoRound quantization can run on a variety of devices, including CPUs and GPUs. Quantization and serving may not happen on the same device. For example, you can quantize on a workstation with GPU and later deploy on AIPC. + +```Python +from llmcompressor import oneshot +from llmcompressor.modifiers.autoround import AutoRoundModifier + +recipe = AutoRoundModifier( + targets="Linear", + scheme="W4A16", + ignore=["lm_head"], + iters=200, + enable_torch_compile=False, + batch_size=2, +) + +oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, + shuffle_calibration_samples=False, +) + +SAVE_DIR = MODEL_ID.split("/")[-1] + "-W4A16-G128-AutoRound" +model.save_pretrained(SAVE_DIR, save_compressed=True) +tokenizer.save_pretrained(SAVE_DIR) +``` + +In practice, **128 calibration samples + ~200 iterations** often reach stable convergence. Increase the number of samples or iterations if you are targeting extremely low bits or tighter accuracy targets. + +### 5. Serve in vLLM + +Once quantization is complete, the same compressed model can be served on different hardware, independent of the device used for tuning. For example, you can serve the quantized Qwen3‑8B‑W4A16‑G128‑AutoRound model on a single **Intel®** **Arc****™ Pro B60** **GPU**: + +```Bash +vllm serve Qwen3-8B-W4A16-G128-AutoRound \ + --dtype=bfloat16 \ + --enforce-eager \ + --gpu-memory-util=0.8 \ + --no-enable-prefix-caching \ + --max-num-batched-tokens=8192 +``` + +Note: please install vLLM from this PR https://github.com/vllm-project/vllm/pull/29484/ + +### 6. Evaluate (Example: GSM8K with `lm_eval`) + +```Bash +lm_eval --model vllm \ + --model_args pretrained="./Qwen3-8B-W4A16-G128-AutoRound,add_bos_token=truemax_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enforce_eager=True" \ + --tasks gsm8k \ + --num_fewshot 5 \ + --limit 1000 \ + --batch_size 'auto' +|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| +|-----|------:|----------------|-----:|-----------|---|----:|---|-----:| +|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.908|± |0.0091| +| | |strict-match | 5|exact_match|↑ |0.907|± |0.0092| +``` + +## Conclusion & Future Plans + +With this first integration, AutoRound and LLM Compressor already provide a practical, production‑oriented path to low‑bit LLMs: W4A16 quantization is supported end‑to‑end, the workflow is simple to configure, and dense models such as Llama and Qwen. The setup is robust, streamlined, and ready for practical deployment. + +Looking ahead, we plan to extend support to additional schemes such as FP8, MXFP4, MXFP8, and NVFP4, add automatic mixed‑bit search for fine‑grained per‑layer optimization, and cover more model families, including Mixture‑of‑Experts (MoE) models. We also aim to deepen interoperability with other algorithms in LLM Compressor. So AutoRound can be combined into richer multi‑modifier recipes that serve both community use cases and Intel production workloads. + +If you’d like to influence which formats, models, and workflows we prioritize next, please join the discussion in [RFC #1968](https://github.com/vllm-project/llm-compressor/issues/1968) and share your benchmarks or deployment requirements, or bring your feedback to the Intel Community so we can align the roadmap with real‑world needs. + +### Acknowledgements + +We’d like to thank the **vLLM / LLM Compressor** community for extensive early discussions on the proposal and for their thoughtful reviews of the pull requests. + +#### Related RFCs and PRs + +RFC: https://github.com/vllm-project/llm-compressor/issues/1968 + +PRs: + +- https://github.com/vllm-project/llm-compressor/pull/1994 +- https://github.com/vllm-project/llm-compressor/pull/2055 +- https://github.com/vllm-project/llm-compressor/pull/2062 (Under Review) +- https://github.com/vllm-project/vllm/pull/29484/ (Under Review) From 10b5c721b20868f41c7e0c19cf5184c4c0dd46d4 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 27 Nov 2025 01:40:50 +0000 Subject: [PATCH 02/17] correct Signed-off-by: yiliu30 --- _posts/2025-11-27-intel-autoround-llmc.md | 24 +++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/_posts/2025-11-27-intel-autoround-llmc.md b/_posts/2025-11-27-intel-autoround-llmc.md index 58f06e9..00907b8 100755 --- a/_posts/2025-11-27-intel-autoround-llmc.md +++ b/_posts/2025-11-27-intel-autoround-llmc.md @@ -19,7 +19,7 @@ Broader quantization schemes and model coverage are coming next—try it now and ## What Is AutoRound? -**AutoRound** is an advanced post-training quantization (PTQ) algorithm designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It introduces three trainable parameters per quantized tensor:`V` (rounding offset/adjustment),`α` and `β` (learned clipping range controls). By processing decoder layers sequentially and applying signed gradient descent, AutoRound jointly optimizes rounding and clipping to minimize block‑wise output reconstruction error. +**AutoRound** is an advanced post-training quantization (PTQ) algorithm designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It introduces three trainable parameters per quantized tensor: `V` (rounding offset/adjustment), `α` and `β` (learned clipping range controls). By processing decoder layers sequentially and applying signed gradient descent, AutoRound jointly optimizes rounding and clipping to minimize block‑wise output reconstruction error. Core strengths: @@ -28,7 +28,7 @@ Core strengths: - **Mixed‑bit**, layer‑wise precision search for flexible accuracy–efficiency trade‑offs - Applicability across both **LLMs** and **VLMs** -AutoRound enables quantized models in a range of low‑bit formats that are designed to accelerate inference on **Intel®** **Xeon****® processors**, **Intel® Gaudi® AI accelerators**, **Intel® Data Center GPUs**, **Intel® Arc™ B‑Series Graphics**, as well as other GPUs (e.g., CUDA‑based devices). +AutoRound enables quantized models in a range of low‑bit formats that are designed to accelerate inference on **Intel® Xeon ® processors**, **Intel® Gaudi® AI accelerators**, **Intel® Data Center GPUs**, **Intel® Arc™ B‑Series Graphics**, as well as other GPUs (e.g., CUDA‑based devices). Looking forward, as Intel’s next‑generation GPUs—**including Intel® Crescent Island**—add native support for **FP8, MXFP8, and MXFP4** formats, models optimized with AutoRound will naturally scale to take advantage of these data types across the Intel AI hardware portfolio. This creates a consistent path from algorithmic innovation to real‑world deployment. @@ -51,7 +51,7 @@ We completed the first stage of integration by introducing the new `AutoRoundMod ### 1. Install -```Bash +```bash git clone https://github.com/vllm-project/llm-compressor.git cd llm-compressor pip install -e . @@ -59,7 +59,7 @@ pip install -e . ### 2. Load Model & Tokenizer -```Python +```python from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_ID = "Qwen/Qwen3-8B" model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") @@ -68,7 +68,7 @@ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) ### 3. Prepare Calibration Data -```Python +```python from auto_round.calib_dataset import get_dataset NUM_CALIBRATION_SAMPLES = 128 MAX_SEQUENCE_LENGTH = 2048 @@ -81,7 +81,7 @@ ds = get_dataset(tokenizer=tokenizer, The AutoRound quantization can run on a variety of devices, including CPUs and GPUs. Quantization and serving may not happen on the same device. For example, you can quantize on a workstation with GPU and later deploy on AIPC. -```Python +```python from llmcompressor import oneshot from llmcompressor.modifiers.autoround import AutoRoundModifier @@ -112,9 +112,9 @@ In practice, **128 calibration samples + ~200 iterations** often reach stable co ### 5. Serve in vLLM -Once quantization is complete, the same compressed model can be served on different hardware, independent of the device used for tuning. For example, you can serve the quantized Qwen3‑8B‑W4A16‑G128‑AutoRound model on a single **Intel®** **Arc****™ Pro B60** **GPU**: +Once quantization is complete, the same compressed model can be served on different hardware, independent of the device used for tuning. For example, you can serve the quantized Qwen3‑8B‑W4A16‑G128‑AutoRound model on a single **Intel® Arc™ Pro B60 GPU**: -```Bash +```bash vllm serve Qwen3-8B-W4A16-G128-AutoRound \ --dtype=bfloat16 \ --enforce-eager \ @@ -127,9 +127,9 @@ Note: please install vLLM from this PR https://github.com/vllm-project/vllm/pull ### 6. Evaluate (Example: GSM8K with `lm_eval`) -```Bash +```bash lm_eval --model vllm \ - --model_args pretrained="./Qwen3-8B-W4A16-G128-AutoRound,add_bos_token=truemax_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enforce_eager=True" \ + --model_args pretrained="./Qwen3-8B-W4A16-G128-AutoRound,add_bos_token=true,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enforce_eager=True" \ --tasks gsm8k \ --num_fewshot 5 \ --limit 1000 \ @@ -142,9 +142,9 @@ lm_eval --model vllm \ ## Conclusion & Future Plans -With this first integration, AutoRound and LLM Compressor already provide a practical, production‑oriented path to low‑bit LLMs: W4A16 quantization is supported end‑to‑end, the workflow is simple to configure, and dense models such as Llama and Qwen. The setup is robust, streamlined, and ready for practical deployment. +With this first integration, AutoRound and LLM Compressor already provide a practical, production‑oriented path to low‑bit LLMs: W4A16 quantization is supported end‑to‑end, the workflow is simple to configure, and dense models such as Llama and Qwen are supported. The setup is robust, streamlined, and ready for practical deployment. -Looking ahead, we plan to extend support to additional schemes such as FP8, MXFP4, MXFP8, and NVFP4, add automatic mixed‑bit search for fine‑grained per‑layer optimization, and cover more model families, including Mixture‑of‑Experts (MoE) models. We also aim to deepen interoperability with other algorithms in LLM Compressor. So AutoRound can be combined into richer multi‑modifier recipes that serve both community use cases and Intel production workloads. +Looking ahead, we plan to extend support to additional schemes such as FP8, MXFP4, MXFP8, and NVFP4, add automatic mixed‑bit search for fine‑grained per‑layer optimization, and cover more model families, including Mixture‑of‑Experts (MoE) models. We also aim to deepen interoperability with other algorithms in LLM Compressor, which will allow AutoRound to combined into richer multi‑modifier recipes that serve both community use cases and Intel production workloads. If you’d like to influence which formats, models, and workflows we prioritize next, please join the discussion in [RFC #1968](https://github.com/vllm-project/llm-compressor/issues/1968) and share your benchmarks or deployment requirements, or bring your feedback to the Intel Community so we can align the roadmap with real‑world needs. From dbeed266caa61630ba82ede35032aabb3a811b6a Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 27 Nov 2025 01:44:51 +0000 Subject: [PATCH 03/17] update Signed-off-by: yiliu30 --- _posts/2025-11-27-intel-autoround-llmc.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/_posts/2025-11-27-intel-autoround-llmc.md b/_posts/2025-11-27-intel-autoround-llmc.md index 00907b8..d4364f0 100755 --- a/_posts/2025-11-27-intel-autoround-llmc.md +++ b/_posts/2025-11-27-intel-autoround-llmc.md @@ -134,6 +134,8 @@ lm_eval --model vllm \ --num_fewshot 5 \ --limit 1000 \ --batch_size 'auto' +``` +```bash |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.908|± |0.0091| From e823e9284e0fe492e3a0b9183f2990e1c76895fc Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 3 Dec 2025 07:51:01 +0000 Subject: [PATCH 04/17] update Signed-off-by: yiliu30 --- ...lmc.md => 2025-12-03-intel-autoround-llmc.md} | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) rename _posts/{2025-11-27-intel-autoround-llmc.md => 2025-12-03-intel-autoround-llmc.md} (89%) diff --git a/_posts/2025-11-27-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md similarity index 89% rename from _posts/2025-11-27-intel-autoround-llmc.md rename to _posts/2025-12-03-intel-autoround-llmc.md index d4364f0..61a0786 100755 --- a/_posts/2025-11-27-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -1,8 +1,7 @@ --- layout: post -title: "Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor [Draft]" -author: "Intel Neural Compressor Team" -image: /assets/figures/2025-vllm-on-intel-arc/perf-figure1.png +title: "Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor" +author: "Intel Neural Compressor Team, Red Hat AI Model Optimization Team" --- @@ -30,13 +29,13 @@ Core strengths: AutoRound enables quantized models in a range of low‑bit formats that are designed to accelerate inference on **Intel® Xeon ® processors**, **Intel® Gaudi® AI accelerators**, **Intel® Data Center GPUs**, **Intel® Arc™ B‑Series Graphics**, as well as other GPUs (e.g., CUDA‑based devices). -Looking forward, as Intel’s next‑generation GPUs—**including Intel® Crescent Island**—add native support for **FP8, MXFP8, and MXFP4** formats, models optimized with AutoRound will naturally scale to take advantage of these data types across the Intel AI hardware portfolio. This creates a consistent path from algorithmic innovation to real‑world deployment. +Looking forward, Intel is adding native support for FP8, MXFP8, and MXFP4 formats to its next-generation **Data Center GPUs, codenamed Crescent Island**. Models quantized with AutoRound will naturally scale to take advantage of these data types across the Intel AI hardware portfolio. This creates a consistent path from algorithmic innovation to real‑world deployment. For more details, please refer to the paper [AutoRound (EMNLP 2024)](https://aclanthology.org/2024.findings-emnlp.662.pdf) and the GitHub repository [intel/auto-round](https://github.com/intel/auto-round). ## Why Integrate Into LLM Compressor? -**LLM** **Compressor** already provides a unified, modular system for compression primitives such as quantization, pruning, and distillation. Integrating AutoRound into this ecosystem: +**LLM** **Compressor** already provides a unified, modular system for compression primitives such as quantization and pruning. Integrating AutoRound into this ecosystem: - Aligns with the existing modifier architecture (e.g., `GPTQModifier`) - Reuses the sequential calibration and layer‑onloading infrastructure @@ -90,8 +89,6 @@ recipe = AutoRoundModifier( scheme="W4A16", ignore=["lm_head"], iters=200, - enable_torch_compile=False, - batch_size=2, ) oneshot( @@ -141,6 +138,7 @@ lm_eval --model vllm \ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.908|± |0.0091| | | |strict-match | 5|exact_match|↑ |0.907|± |0.0092| ``` +Note: The results may fluctuate due to non-determinism. ## Conclusion & Future Plans @@ -152,7 +150,7 @@ If you’d like to influence which formats, models, and workflows we prioritize ### Acknowledgements -We’d like to thank the **vLLM / LLM Compressor** community for extensive early discussions on the proposal and for their thoughtful reviews of the pull requests. +We wish to acknowledge the contributions of the LLM Compressor community. Specifically, we thank Kyle Sayers, Dipika Sikka, Brian Dellabetta, Charles Hernandez, and Robert Shaw for their invaluable feedback on the early proposal and their diligent review of the pull requests. #### Related RFCs and PRs @@ -162,5 +160,5 @@ PRs: - https://github.com/vllm-project/llm-compressor/pull/1994 - https://github.com/vllm-project/llm-compressor/pull/2055 -- https://github.com/vllm-project/llm-compressor/pull/2062 (Under Review) +- https://github.com/vllm-project/llm-compressor/pull/2062 - https://github.com/vllm-project/vllm/pull/29484/ (Under Review) From 0d3dcfce6839e49b8914710d0b2f301605bc45a4 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 3 Dec 2025 07:59:14 +0000 Subject: [PATCH 05/17] fix Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index 61a0786..66277eb 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -126,13 +126,12 @@ Note: please install vLLM from this PR https://github.com/vllm-project/vllm/pull ```bash lm_eval --model vllm \ - --model_args pretrained="./Qwen3-8B-W4A16-G128-AutoRound,add_bos_token=true,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enforce_eager=True" \ + --model_args pretrained="./Qwen3-8B-W4A16-G128-AutoRound,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enforce_eager=True" \ --tasks gsm8k \ --num_fewshot 5 \ --limit 1000 \ - --batch_size 'auto' -``` -```bash + --batch_size 128 + |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.908|± |0.0091| From 39a57ea0730e7785bf809254b9f4ac9a4b768842 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 3 Dec 2025 08:06:43 +0000 Subject: [PATCH 06/17] fix Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index 66277eb..8972323 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -27,7 +27,7 @@ Core strengths: - **Mixed‑bit**, layer‑wise precision search for flexible accuracy–efficiency trade‑offs - Applicability across both **LLMs** and **VLMs** -AutoRound enables quantized models in a range of low‑bit formats that are designed to accelerate inference on **Intel® Xeon ® processors**, **Intel® Gaudi® AI accelerators**, **Intel® Data Center GPUs**, **Intel® Arc™ B‑Series Graphics**, as well as other GPUs (e.g., CUDA‑based devices). +AutoRound enables quantized models in a range of low‑bit formats that are designed to accelerate inference on **Intel® Xeon® processors**, **Intel® Gaudi® AI accelerators**, **Intel® Data Center GPUs**, **Intel® Arc™ B‑Series Graphics**, as well as other GPUs (e.g., CUDA‑based devices). Looking forward, Intel is adding native support for FP8, MXFP8, and MXFP4 formats to its next-generation **Data Center GPUs, codenamed Crescent Island**. Models quantized with AutoRound will naturally scale to take advantage of these data types across the Intel AI hardware portfolio. This creates a consistent path from algorithmic innovation to real‑world deployment. From c54e0d326bb2c3b70c41fe621f83d720dcc303ba Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 3 Dec 2025 08:24:56 +0000 Subject: [PATCH 07/17] update Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index 8972323..daee1e5 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -134,8 +134,8 @@ lm_eval --model vllm \ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| -|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.908|± |0.0091| -| | |strict-match | 5|exact_match|↑ |0.907|± |0.0092| +|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.911|± | 0.009| +| | |strict-match | 5|exact_match|↑ |0.911|± | 0.009| ``` Note: The results may fluctuate due to non-determinism. From 93af89fd4cf2b17b50ce571f956367ae1728b8d6 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 3 Dec 2025 08:29:56 +0000 Subject: [PATCH 08/17] update Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index daee1e5..1adef25 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -153,11 +153,4 @@ We wish to acknowledge the contributions of the LLM Compressor community. Specif #### Related RFCs and PRs -RFC: https://github.com/vllm-project/llm-compressor/issues/1968 - -PRs: - -- https://github.com/vllm-project/llm-compressor/pull/1994 -- https://github.com/vllm-project/llm-compressor/pull/2055 -- https://github.com/vllm-project/llm-compressor/pull/2062 -- https://github.com/vllm-project/vllm/pull/29484/ (Under Review) +[llm-compressor#1968](https://github.com/vllm-project/llm-compressor/issues/1968), [llm-compressor#1994](https://github.com/vllm-project/llm-compressor/pull/1994), [llm-compressor#2055](https://github.com/vllm-project/llm-compressor/pull/2055), [llm-compressor#2062](https://github.com/vllm-project/llm-compressor/pull/2062), [vllm#29484](https://github.com/vllm-project/vllm/pull/29484) (Under Review). From 27e5a9e07beb1c562c59a906b32de88f77b4aec6 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 4 Dec 2025 01:22:15 +0000 Subject: [PATCH 09/17] update Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index 1adef25..2132f43 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -4,6 +4,7 @@ title: "Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor" author: "Intel Neural Compressor Team, Red Hat AI Model Optimization Team" --- +**Achieve faster, more efficient LLM serving without sacrificing accuracy!** ## TL;DR @@ -13,6 +14,7 @@ We’re excited to announce that **[AutoRound](https://aclanthology.org/2024.fin - Lightweight tuning (hundreds of steps, not thousands) - Zero additional inference overhead - Seamless compatibility with `compressed-tensors` and direct serving in [vLLM](https://github.com/vllm-project/vllm) +- Streamlined workflow: quantize and serve models with just a few lines of code Broader quantization schemes and model coverage are coming next—try it now and help shape what we build. From b1fc7e626b5537fc5229ba0cf8ed1ce5974d30e0 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 4 Dec 2025 01:22:51 +0000 Subject: [PATCH 10/17] format Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index 2132f43..4913eb2 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -39,9 +39,9 @@ For more details, please refer to the paper [AutoRound (EMNLP 2024)](https://acl **LLM** **Compressor** already provides a unified, modular system for compression primitives such as quantization and pruning. Integrating AutoRound into this ecosystem: -- Aligns with the existing modifier architecture (e.g., `GPTQModifier`) -- Reuses the sequential calibration and layer‑onloading infrastructure -- Enables future interoperability with richer multi‑modifier recipes +- Aligns with the existing modifier architecture (e.g., `GPTQModifier`) +- Reuses the sequential calibration and layer‑onloading infrastructure +- Enables future interoperability with richer multi‑modifier recipes - Produces quantized models that are ready for vLLM serving, enabling a clean workflow from compression to deployment ## Integration Overview From 7ed2f21978ba1fe01d2b0250226800ef06c64779 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 4 Dec 2025 01:43:19 +0000 Subject: [PATCH 11/17] fix Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index 4913eb2..d145c06 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -155,4 +155,5 @@ We wish to acknowledge the contributions of the LLM Compressor community. Specif #### Related RFCs and PRs -[llm-compressor#1968](https://github.com/vllm-project/llm-compressor/issues/1968), [llm-compressor#1994](https://github.com/vllm-project/llm-compressor/pull/1994), [llm-compressor#2055](https://github.com/vllm-project/llm-compressor/pull/2055), [llm-compressor#2062](https://github.com/vllm-project/llm-compressor/pull/2062), [vllm#29484](https://github.com/vllm-project/vllm/pull/29484) (Under Review). +[llm-compressor#1968](https://github.com/vllm-project/llm-compressor/issues/1968), [llm-compressor#1994](https://github.com/vllm-project/llm-compressor/pull/1994), [llm-compressor#2055](https://github.com/vllm-project/llm-compressor/pull/2055), [llm-compressor#2062](https://github.com/vllm-project/llm-compressor/pull/2062), [auto-round#993](https://github.com/intel/auto-round/pull/993), [auto-round#1053](https://github.com/intel/auto-round/pull/1053), [auto-round#1055](https://github.com/intel/auto-round/pull/1055), [auto-round#1072](https://github.com/intel/auto-round/pull/1072), +[vllm#29484](https://github.com/vllm-project/vllm/pull/29484) (Under Review). From e66caf6657229d9b370cac631a2d6b8446b1213d Mon Sep 17 00:00:00 2001 From: Yi Liu Date: Thu, 4 Dec 2025 21:29:26 +0800 Subject: [PATCH 12/17] Update _posts/2025-12-03-intel-autoround-llmc.md Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yi Liu --- _posts/2025-12-03-intel-autoround-llmc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index d145c06..56a3421 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -155,5 +155,5 @@ We wish to acknowledge the contributions of the LLM Compressor community. Specif #### Related RFCs and PRs -[llm-compressor#1968](https://github.com/vllm-project/llm-compressor/issues/1968), [llm-compressor#1994](https://github.com/vllm-project/llm-compressor/pull/1994), [llm-compressor#2055](https://github.com/vllm-project/llm-compressor/pull/2055), [llm-compressor#2062](https://github.com/vllm-project/llm-compressor/pull/2062), [auto-round#993](https://github.com/intel/auto-round/pull/993), [auto-round#1053](https://github.com/intel/auto-round/pull/1053), [auto-round#1055](https://github.com/intel/auto-round/pull/1055), [auto-round#1072](https://github.com/intel/auto-round/pull/1072), +[llm-compressor#1968](https://github.com/vllm-project/llm-compressor/issues/1968), [llm-compressor#1994](https://github.com/vllm-project/llm-compressor/pull/1994), [llm-compressor#2055](https://github.com/vllm-project/llm-compressor/pull/2055), [llm-compressor#2062](https://github.com/vllm-project/llm-compressor/pull/2062), [auto-round#993](https://github.com/intel/auto-round/pull/993), [auto-round#1053](https://github.com/intel/auto-round/pull/1053), [auto-round#1055](https://github.com/intel/auto-round/pull/1055), [auto-round#1072](https://github.com/intel/auto-round/pull/1072), [vllm#29484](https://github.com/vllm-project/vllm/pull/29484) (Under Review). From de67e9fbdbba743b0e34a1b6a4dac3d1c6d2f95a Mon Sep 17 00:00:00 2001 From: Yi Liu Date: Thu, 4 Dec 2025 21:34:36 +0800 Subject: [PATCH 13/17] Update _posts/2025-12-03-intel-autoround-llmc.md Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yi Liu --- _posts/2025-12-03-intel-autoround-llmc.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index 56a3421..bf3a509 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -116,10 +116,8 @@ Once quantization is complete, the same compressed model can be served on differ ```bash vllm serve Qwen3-8B-W4A16-G128-AutoRound \ --dtype=bfloat16 \ - --enforce-eager \ - --gpu-memory-util=0.8 \ - --no-enable-prefix-caching \ - --max-num-batched-tokens=8192 + --gpu-memory-utilization 0.8 \ + --max-num-batched-tokens 8192 ``` Note: please install vLLM from this PR https://github.com/vllm-project/vllm/pull/29484/ From c64f54893b4c4398ff7479edc6fe25013e012d39 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 4 Dec 2025 13:50:02 +0000 Subject: [PATCH 14/17] update Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index bf3a509..9b336b2 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -120,7 +120,7 @@ vllm serve Qwen3-8B-W4A16-G128-AutoRound \ --max-num-batched-tokens 8192 ``` -Note: please install vLLM from this PR https://github.com/vllm-project/vllm/pull/29484/ +Note: Please install vLLM from PR #29484. When serving on XPU, you must run vLLM with the --enforce-eager flag. ### 6. Evaluate (Example: GSM8K with `lm_eval`) From fcddf20325e19bfefb410e5dc5ffc7a12ed68180 Mon Sep 17 00:00:00 2001 From: Yi Liu Date: Thu, 4 Dec 2025 21:37:07 +0800 Subject: [PATCH 15/17] Update _posts/2025-12-03-intel-autoround-llmc.md Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yi Liu --- _posts/2025-12-03-intel-autoround-llmc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index 9b336b2..adfb6c5 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -46,7 +46,7 @@ For more details, please refer to the paper [AutoRound (EMNLP 2024)](https://acl ## Integration Overview -We completed the first stage of integration by introducing the new `AutoRoundModifier` into LLM Compressor, enabling production of `wNa16` (e.g., W4A16) compressed models that seamlessly load in vLLM, as implemented in [PR #1994](https://github.com/vllm-project/llm-compressor/pull/1994). With a straightforward configuration—just specify your model and calibration data—you can quickly generate high‑quality low‑bit checkpoints. This initial stage supports quantizing a range of dense LLMs, including the **Llama** and **Qwen** model families, and demonstrates robust compatibility for practical deployment. +We completed the first stage of integration by introducing the new `AutoRoundModifier` into LLM Compressor, enabling production of `W{n}A16` (e.g., W4A16) compressed models that seamlessly load in vLLM, as implemented in [PR #1994](https://github.com/vllm-project/llm-compressor/pull/1994). With a straightforward configuration—just specify your model and calibration data—you can quickly generate high‑quality low‑bit checkpoints. This initial stage supports quantizing a range of dense LLMs, including the **Llama** and **Qwen** model families, and demonstrates robust compatibility for practical deployment. ## Try It Now (Quickstart) From 992ad6753bf3591c1066e63d527f0aeae9bd9065 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 4 Dec 2025 13:51:14 +0000 Subject: [PATCH 16/17] update Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index adfb6c5..effda08 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -120,7 +120,7 @@ vllm serve Qwen3-8B-W4A16-G128-AutoRound \ --max-num-batched-tokens 8192 ``` -Note: Please install vLLM from PR #29484. When serving on XPU, you must run vLLM with the --enforce-eager flag. +Note: Please install vLLM from PR #29484. When serving on XPU, you must run vLLM with the `--enforce-eager` flag. ### 6. Evaluate (Example: GSM8K with `lm_eval`) From 662d2488e8eb69bd71daf643f14a45c684aedd52 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Sat, 6 Dec 2025 04:50:54 +0000 Subject: [PATCH 17/17] update Signed-off-by: yiliu30 --- _posts/2025-12-03-intel-autoround-llmc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-12-03-intel-autoround-llmc.md b/_posts/2025-12-03-intel-autoround-llmc.md index effda08..996cfc3 100755 --- a/_posts/2025-12-03-intel-autoround-llmc.md +++ b/_posts/2025-12-03-intel-autoround-llmc.md @@ -154,4 +154,4 @@ We wish to acknowledge the contributions of the LLM Compressor community. Specif #### Related RFCs and PRs [llm-compressor#1968](https://github.com/vllm-project/llm-compressor/issues/1968), [llm-compressor#1994](https://github.com/vllm-project/llm-compressor/pull/1994), [llm-compressor#2055](https://github.com/vllm-project/llm-compressor/pull/2055), [llm-compressor#2062](https://github.com/vllm-project/llm-compressor/pull/2062), [auto-round#993](https://github.com/intel/auto-round/pull/993), [auto-round#1053](https://github.com/intel/auto-round/pull/1053), [auto-round#1055](https://github.com/intel/auto-round/pull/1055), [auto-round#1072](https://github.com/intel/auto-round/pull/1072), -[vllm#29484](https://github.com/vllm-project/vllm/pull/29484) (Under Review). +[vllm#29484](https://github.com/vllm-project/vllm/pull/29484).