Skip to content

RFC: Add Intel AutoRound Quantization Algorithm Support #1968

@yiliu30

Description

@yiliu30

Hi, here is the INC team from Intel. Thank you for developing this amazing project!

Our team has developed AutoRound, a novel tuning-based quantization algorithm that delivers state-of-the-art accuracy with low tuning cost and no extra inference overhead. The key features are listed below:

  • Superior Accuracy: Achieves leading performance across various scenarios, including very low bits (2-3 bits).
  • Multiple Data Types: Effective for a variety of data types (W4A16, MXFP8, MXFP4, FP8, NVFP4, etc.).
  • Broad Model Support: Validated on both Large Language Models (LLMs) and Vision-Language Models (VLMs).
  • Advanced Mixed-Precision: Supports layer-wise mixed-bit quantization and automatic bit-width search.
  • Flexible Tuning: Offers a configurable tuning space to balance tuning cost and accuracy goals.

For more detailed information, please refer to our paper, and GitHub repository.

We propose integrating AutoRound into LLM Compressor to provide users with a simple, high-accuracy quantization option.

The key Idea of AutoRound

AutoRound quantizes a given tensor by introducing three trainable parameters (V, α and β) to adjust the rounding value and clipping range. For a given transformers model, AutoRound quantizes the decoder block one by one, using block-wise output reconstruction error as loss to train these parameters.

autoround_overview

Integration Proposal

AutoRound is already integrated into popular LLM ecosystems such as Transformers, vLLM, SGLang and TorchAO. For LLM Compressor, we proposed two integration options as below:

Option 1. A New modifier for AutoRound (Recommend)

This approach integrates AutoRound as a new modifier within LLM Compressor. Specifically, we propose creating an AutoRoundModifier that wraps the AutoRound algorithm, delegating the actual tuning logic to the intel/auto-round library.
This method provides a lightweight, maintainable integration that directly leverages the AutoRound library while aligning perfectly with LLM Compressor's modifier architecture.

option1
(Modified from deepwiki https://deepwiki.com/vllm-project/llm-compressor/)

The initial design and potential code changes are as follows:

# LLM Compressor/src/llmcompressor/modifiers/quantization/autoround/core.py

class AutoRoundModifier(Modifier, QuantizationMixin):

    def on_event(self, state: State, event: Event, **kwargs):
        if event.type_ == EventType.CALIBRATION_EPOCH_START:
            if not self.started_:
                self.on_start(state, None)

        if event.type_ == EventType.SEQUENTIAL_EPOCH_END:
            self.compress_modules()

        if event.type_ == EventType.CALIBRATION_EPOCH_END:
            self.compress_modules()

            if not self.ended_:
                self.on_end(state, None)

    def compress_modules(self):
        """
        Apply AutoRound tuning to a single decoding layer.
        This method delegates actual quantization to the AutoRound, 
        and updates the layer’s weights, scales, and zero points accordingly.
        """
        # TODO: Integrate with external tuning API.
        # Example (pseudo, TBD):
        #     tuned_layer = autoround.quant_block(decoding_layer, config=self.config)
        #     self._update_layer(decoding_layer, tuned_layer)
        pass

Option 2. Introduce a New Pipeline Wrapping Intel Neural Compressor (INC)

AutoRound is already implemented in Intel/neural-compressor. INC supports popular model compression techniques across mainstream deep learning frameworks, including PyTorch, TensorFlow, and Jax (planned for the near future).
In this option, we propose introducing a new pipeline to wrap INC. This approach would allow us to leverage INC's framework abstraction infrastructure, paving the way for supporting more frameworks in the future.
option2
(Modified from deepwiki https://deepwiki.com/vllm-project/llm-compressor/)

# llm-compressor/src/llmcompressor/pipelines/inc/pipeline.py

@CalibrationPipeline.register("inc")
class INCPipeline(CalibrationPipeline):
    @staticmethod
    def __call__(
        model: torch.nn.Module,
        dataloader: DataLoader,
        dataset_args: Union["DatasetArguments", None],
    ):
        """
        Applies the INC calibration procedure.
        """
        dispatch_for_generation(model)  # basic dispatch is identical to generation
        model_device = get_execution_device(model)

        from neural_compressor.torch.quantization import prepare, convert
        inc_config = prepare_autoround_config(...)
        model = prepare(model, inc_config)
        # apply calibration
        for batch in tqdm.tqdm(dataloader, desc="Calibrating"):
            model(**batch)
        model = convert(model)

Comparison Between Two Options:

Item Option 1: AutoRound Modifier Option 2: INC Pipeline
Architectural Alignment Native fit within LLM Compressor’s modifier paradigm. A new pipeline
Dependencies Only requires auto-roundlibrary. Entire INC toolkit and its dependencies.
Implementation Effort Utilizes existing APIs with localized modifications. Requires designing a new pipeline and mapping configurations.
Model Scope Focus on LLMs and VLMs Broader, including LLMs, VLMs, and models from other domains like CV.
Framework Support Focus PyTorch Multi-framework: Inherits INC’s PyTorch/TF/JAX support.

Based on the comparison, we recommend going with Option 1 to add AutoRound support. Please feel free to comment on the flow mentioned above or suggest additional approaches:). Thank you in advance!

cc @hshen14 @thuang6 @wenhuach21

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestnvfp4For any PR / issue related to NVFP4 supportwNa16Anything related to weight-only int-quantized support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions