-
Notifications
You must be signed in to change notification settings - Fork 273
Description
Hi, here is the INC team from Intel. Thank you for developing this amazing project!
Our team has developed AutoRound, a novel tuning-based quantization algorithm that delivers state-of-the-art accuracy with low tuning cost and no extra inference overhead. The key features are listed below:
- Superior Accuracy: Achieves leading performance across various scenarios, including very low bits (2-3 bits).
- Multiple Data Types: Effective for a variety of data types (W4A16, MXFP8, MXFP4, FP8, NVFP4, etc.).
- Broad Model Support: Validated on both Large Language Models (LLMs) and Vision-Language Models (VLMs).
- Advanced Mixed-Precision: Supports layer-wise mixed-bit quantization and automatic bit-width search.
- Flexible Tuning: Offers a configurable tuning space to balance tuning cost and accuracy goals.
For more detailed information, please refer to our paper, and GitHub repository.
We propose integrating AutoRound into LLM Compressor to provide users with a simple, high-accuracy quantization option.
The key Idea of AutoRound
AutoRound quantizes a given tensor by introducing three trainable parameters (V, α and β) to adjust the rounding value and clipping range. For a given transformers model, AutoRound quantizes the decoder block one by one, using block-wise output reconstruction error as loss to train these parameters.
Integration Proposal
AutoRound is already integrated into popular LLM ecosystems such as Transformers, vLLM, SGLang and TorchAO. For LLM Compressor, we proposed two integration options as below:
Option 1. A New modifier for AutoRound (Recommend)
This approach integrates AutoRound as a new modifier within LLM Compressor. Specifically, we propose creating an AutoRoundModifier that wraps the AutoRound algorithm, delegating the actual tuning logic to the intel/auto-round library.
This method provides a lightweight, maintainable integration that directly leverages the AutoRound library while aligning perfectly with LLM Compressor's modifier architecture.

(Modified from deepwiki https://deepwiki.com/vllm-project/llm-compressor/)
The initial design and potential code changes are as follows:
# LLM Compressor/src/llmcompressor/modifiers/quantization/autoround/core.py
class AutoRoundModifier(Modifier, QuantizationMixin):
def on_event(self, state: State, event: Event, **kwargs):
if event.type_ == EventType.CALIBRATION_EPOCH_START:
if not self.started_:
self.on_start(state, None)
if event.type_ == EventType.SEQUENTIAL_EPOCH_END:
self.compress_modules()
if event.type_ == EventType.CALIBRATION_EPOCH_END:
self.compress_modules()
if not self.ended_:
self.on_end(state, None)
def compress_modules(self):
"""
Apply AutoRound tuning to a single decoding layer.
This method delegates actual quantization to the AutoRound,
and updates the layer’s weights, scales, and zero points accordingly.
"""
# TODO: Integrate with external tuning API.
# Example (pseudo, TBD):
# tuned_layer = autoround.quant_block(decoding_layer, config=self.config)
# self._update_layer(decoding_layer, tuned_layer)
passOption 2. Introduce a New Pipeline Wrapping Intel Neural Compressor (INC)
AutoRound is already implemented in Intel/neural-compressor. INC supports popular model compression techniques across mainstream deep learning frameworks, including PyTorch, TensorFlow, and Jax (planned for the near future).
In this option, we propose introducing a new pipeline to wrap INC. This approach would allow us to leverage INC's framework abstraction infrastructure, paving the way for supporting more frameworks in the future.

(Modified from deepwiki https://deepwiki.com/vllm-project/llm-compressor/)
# llm-compressor/src/llmcompressor/pipelines/inc/pipeline.py
@CalibrationPipeline.register("inc")
class INCPipeline(CalibrationPipeline):
@staticmethod
def __call__(
model: torch.nn.Module,
dataloader: DataLoader,
dataset_args: Union["DatasetArguments", None],
):
"""
Applies the INC calibration procedure.
"""
dispatch_for_generation(model) # basic dispatch is identical to generation
model_device = get_execution_device(model)
from neural_compressor.torch.quantization import prepare, convert
inc_config = prepare_autoround_config(...)
model = prepare(model, inc_config)
# apply calibration
for batch in tqdm.tqdm(dataloader, desc="Calibrating"):
model(**batch)
model = convert(model)Comparison Between Two Options:
| Item | Option 1: AutoRound Modifier | Option 2: INC Pipeline |
|---|---|---|
| Architectural Alignment | Native fit within LLM Compressor’s modifier paradigm. | A new pipeline |
| Dependencies | Only requires auto-roundlibrary. |
Entire INC toolkit and its dependencies. |
| Implementation Effort | Utilizes existing APIs with localized modifications. | Requires designing a new pipeline and mapping configurations. |
| Model Scope | Focus on LLMs and VLMs | Broader, including LLMs, VLMs, and models from other domains like CV. |
| Framework Support | Focus PyTorch | Multi-framework: Inherits INC’s PyTorch/TF/JAX support. |
Based on the comparison, we recommend going with Option 1 to add AutoRound support. Please feel free to comment on the flow mentioned above or suggest additional approaches:). Thank you in advance!
