RFC: Add Intel AutoRound Quantization Algorithm Support

Hi, here is the INC team from Intel.  Thank you for developing this amazing project!

Our team has developed **AutoRound**, a novel tuning-based quantization algorithm that delivers state-of-the-art accuracy with low tuning cost and no extra inference overhead. The key features are listed below:

- **Superior Accuracy:** Achieves leading performance across various scenarios, including very low bits (2-3 bits).
- **Multiple Data Types:** Effective for a variety of data types (W4A16, MXFP8, MXFP4, FP8, NVFP4, etc.).
- **Broad Model Support:** Validated on both Large Language Models (LLMs) and Vision-Language Models (VLMs).
- **Advanced Mixed-Precision:** Supports layer-wise mixed-bit quantization and automatic bit-width search.
- **Flexible Tuning:** Offers a configurable tuning space to balance tuning cost and accuracy goals.

 For more detailed information, please refer to our [paper](https://aclanthology.org/2024.findings-emnlp.662.pdf), and [GitHub repository](https://github.com/intel/auto-round/).

We propose integrating AutoRound into LLM Compressor to provide users with a simple, high-accuracy quantization option.

### The key Idea of AutoRound

AutoRound quantizes a given tensor by introducing three trainable parameters (`V`, &alpha; and &beta;) to adjust the rounding value and clipping range. For a given transformers model, AutoRound quantizes the decoder block one by one, using block-wise output reconstruction error as loss to train these parameters.


![autoround_overview](https://github.com/user-attachments/assets/bec1a573-f091-4b46-9dce-0302022746b3)

### Integration Proposal

AutoRound is already integrated into popular LLM ecosystems such as [Transformers](https://huggingface.co/docs/transformers/quantization/auto_round), [vLLM](https://docs.vllm.ai/en/latest/features/quantization/auto_round.html), [SGLang](https://github.com/sgl-project/sglang/pull/10153) and [TorchAO](https://github.com/pytorch/ao/tree/main/torchao/prototype/autoround).  For LLM Compressor, we proposed two integration options as below:

#### Option 1. A New `modifier` for AutoRound (Recommend)

This approach integrates AutoRound as a new modifier within LLM Compressor. Specifically, we propose creating an `AutoRoundModifier` that wraps the AutoRound algorithm, delegating the actual tuning logic to the [intel/auto-round](https://github.com/intel/auto-round) library. 
This method provides a lightweight, maintainable integration that directly leverages the AutoRound library while aligning perfectly with LLM Compressor's modifier architecture.


![option1](https://github.com/user-attachments/assets/a1c7ebbf-74e0-420e-b83f-3104cedada4a)
(Modified from deepwiki https://deepwiki.com/vllm-project/llm-compressor/)

The initial design and potential code changes are as follows:

```python
# LLM Compressor/src/llmcompressor/modifiers/quantization/autoround/core.py

class AutoRoundModifier(Modifier, QuantizationMixin):

    def on_event(self, state: State, event: Event, **kwargs):
        if event.type_ == EventType.CALIBRATION_EPOCH_START:
            if not self.started_:
                self.on_start(state, None)

        if event.type_ == EventType.SEQUENTIAL_EPOCH_END:
            self.compress_modules()

        if event.type_ == EventType.CALIBRATION_EPOCH_END:
            self.compress_modules()

            if not self.ended_:
                self.on_end(state, None)

    def compress_modules(self):
        """
        Apply AutoRound tuning to a single decoding layer.
        This method delegates actual quantization to the AutoRound, 
        and updates the layer’s weights, scales, and zero points accordingly.
        """
        # TODO: Integrate with external tuning API.
        # Example (pseudo, TBD):
        #     tuned_layer = autoround.quant_block(decoding_layer, config=self.config)
        #     self._update_layer(decoding_layer, tuned_layer)
        pass
```

#### Option 2. Introduce a New Pipeline Wrapping Intel Neural Compressor (INC)

AutoRound is already implemented in [Intel/neural-compressor](https://github.com/intel/neural-compressor). INC supports popular model compression techniques across mainstream deep learning frameworks, including PyTorch, TensorFlow, and Jax (planned for the near future).
In this option, we propose introducing a new `pipeline` to wrap INC. This approach would allow us to leverage INC's framework abstraction infrastructure, paving the way for supporting more frameworks in the future.
![option2](https://github.com/user-attachments/assets/3a8526b1-a2bf-4e3d-812e-7395b6a39f9c)
(Modified from deepwiki https://deepwiki.com/vllm-project/llm-compressor/)


```python
# llm-compressor/src/llmcompressor/pipelines/inc/pipeline.py

@CalibrationPipeline.register("inc")
class INCPipeline(CalibrationPipeline):
    @staticmethod
    def __call__(
        model: torch.nn.Module,
        dataloader: DataLoader,
        dataset_args: Union["DatasetArguments", None],
    ):
        """
        Applies the INC calibration procedure.
        """
        dispatch_for_generation(model)  # basic dispatch is identical to generation
        model_device = get_execution_device(model)

        from neural_compressor.torch.quantization import prepare, convert
        inc_config = prepare_autoround_config(...)
        model = prepare(model, inc_config)
        # apply calibration
        for batch in tqdm.tqdm(dataloader, desc="Calibrating"):
            model(**batch)
        model = convert(model)

```

#### Comparison Between Two Options:
| Item                        | Option 1: AutoRound Modifier                              | Option 2: INC Pipeline                                   |
| :-------------------------- | :-------------------------------------------------------- | :----------------------------------------------------------- |
| Architectural Alignment | Native fit within LLM Compressor’s modifier paradigm.     | A new pipeline                                                 |
| Dependencies            | Only requires `auto-round`library.               | Entire INC toolkit and its dependencies.                        |
| Implementation Effort   |  Utilizes existing APIs with localized modifications. | Requires designing a new pipeline and mapping configurations. |
| Model Scope    | Focus on LLMs and VLMs                       | Broader, including LLMs, VLMs, and models from other domains like CV. |
| Framework Support       | Focus PyTorch    | Multi-framework: Inherits INC’s PyTorch/TF/JAX support.      |

Based on the comparison, we recommend going with Option 1 to add AutoRound support. Please feel free to comment on the flow mentioned above or suggest additional approaches:). Thank you in advance!


cc @hshen14 @thuang6 @wenhuach21 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Add Intel AutoRound Quantization Algorithm Support #1968

The key Idea of AutoRound

Integration Proposal

Option 1. A New `modifier` for AutoRound (Recommend)

Option 2. Introduce a New Pipeline Wrapping Intel Neural Compressor (INC)

Comparison Between Two Options:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Item	Option 1: AutoRound Modifier	Option 2: INC Pipeline
Architectural Alignment	Native fit within LLM Compressor’s modifier paradigm.	A new pipeline
Dependencies	Only requires `auto-round`library.	Entire INC toolkit and its dependencies.
Implementation Effort	Utilizes existing APIs with localized modifications.	Requires designing a new pipeline and mapping configurations.
Model Scope	Focus on LLMs and VLMs	Broader, including LLMs, VLMs, and models from other domains like CV.
Framework Support	Focus PyTorch	Multi-framework: Inherits INC’s PyTorch/TF/JAX support.

RFC: Add Intel AutoRound Quantization Algorithm Support #1968

Description

The key Idea of AutoRound

Integration Proposal

Option 1. A New modifier for AutoRound (Recommend)

Option 2. Introduce a New Pipeline Wrapping Intel Neural Compressor (INC)

Comparison Between Two Options:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option 1. A New `modifier` for AutoRound (Recommend)