llm-compressor supports AutoRound, an advanced quantization technique that delivers high-accuracy, low-bit quantization. The quantized results are fully compatible with compressed-tensors and can be served directly with vLLM.
AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance.
To get started, install:
git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .In summary, AutoRound demonstrates leading or on-par performance at 4-bit precision, with clear advantages for sub-4-bit, as reported in SignRoundV1 (paper), SignRoundV2 (paper) and the Intel Low-Bit Open LLM Leaderboard (link),
INT4 for Large Models (≈30B and above) AutoRound achieves performance comparable to other PTQ methods, as the accuracy drop for these large models is generally minimal.
INT4 for Small-to-Medium LLMs AutoRound is likely to deliver higher accuracy than existing PTQ methods, making it particularly effective for smaller models. See SignRoundV1 And Low Bit Open LLM Leaderboard for accuracy data.
Sub-4-Bit Quantization (INT2/INT3) As the bit-width decreases, AutoRound shows increasing benefits, achieving 10–20% absolute accuracy improvements over PTQ methods, while matching QAT performance at 1–2 orders of magnitude lower tuning cost. See SignRound V2 for details.
New Data Types (MXFP4 / NVFP4) For emerging floating-point formats, AutoRound consistently outperforms RTN in accuracy, demonstrating strong forward compatibility with evolving quantization standards. See SignRound V2 for details.
scheme: Quantization scheme (e.g.,W4A16,W8A16, more schemes will be supported soon)iters: Number of tuning iterations per block. Default: 200batch_size: Batch size for calibration. Default: 8lr: Learning rate for tuning. IfNone, auto-set to1.0/iters. Default:NoneNUM_CALIBRATION_SAMPLES: Number of calibration samples. Default: 128MAX_SEQUENCE_LENGTH: Sequence length of calibration samples. Default: 2048
The accuracy of the quantized model is configured by tuning-related parameters. AutoRound provides four recommended configurations to balance accuracy and quantization speed:
| Mode | Batch Size | Iterations | Sequence Length | Calibration Samples | Learning Rate | Quantization Speed | Memory Usage | Accuracy |
|---|---|---|---|---|---|---|---|---|
default |
8 | 200 | 2048 | 128 | Auto | 🚀🚀 | 🟡 Medium | 🎯🎯 Good |
best |
8 | 1000 | 2048 | 512 | Auto | 🚀 | 🔴 High | 🏆 Best |
light |
8 | 50 | 2048 | 128 | 5e-3 | 🚀🚀🚀 | 🟡 Medium | 🎯🎯 (slight drop in some cases) |
fast |
4 | 200 | 512 | 128 | Auto | 🚀🚀🚀 | 🟢 Low | 🎯 |
Tip
- Use
bestfor production models where accuracy is critical - Use
lightfor rapid iteration during development (2-3× speedup) - Use
fastwhen GPU memory is limited or for quick evaluation - The
defaultrecipe provides a good balance for most use cases
Note
These configurations are based on our experiments and may vary depending on the model architecture.
| Scheme | Examples | Note |
|---|---|---|
wNa16 |
llama3_example | |
wNa16 |
qwen3_example | Multiple cards for Qwen3-235B-A22B |
wNa16 + FP8KV |
llama3_example | |
W8A8-FP8 Static |
llama4_example | |
W8A8-FP8 Dynamic |
llama4_example | |
NVFP4 |
llama3.1_example | |
MXFP4 |
qwen3_example |
Currently, llm-compressor supports applying AutoRound only on the WNA16, NVFP4, and W8A8-FP8 quantization schemes. Support for additional schemes is planned. You can follow progress in the RFC.
Please open up an issue on vllm-project/llm-compressor or intel/auto-round.