Skip to content
Draft
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions docs/steps/choosing-algo.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,64 @@ Weight and activation quantization is best for maximum throughput on modern hard
!!! note
AWQ and GPTQ are typically used for weight-only quantization but can also be applied to weight and activation quantization workflows.

### AWQ details

The AWQ recipe uses the `AWQModifier`, which adjusts model scales ahead of weight quantization:

```python
recipe = [
AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]
```

AWQ requires layer mappings to identify where to apply activation-aware scaling. Mappings for common model families are built in, but you can supply your own via the `mappings` argument. For example, the Llama mapping looks like:

```python
[
AWQMapping("re:.*input_layernorm", ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"]),
AWQMapping("re:.*v_proj", ["re:.*o_proj"]),
AWQMapping("re:.*post_attention_layernorm", ["re:.*gate_proj", "re:.*up_proj"]),
AWQMapping("re:.*up_proj", ["re:.*down_proj"]),
]
```

!!! note
Mappings define which layers get smoothed, while `targets` and `ignore` define which layers get quantized. A layer in the `ignore` list that is matched by a mapping will still be smoothed but not quantized.

To add support for a new model family, supply your own mappings via the `mappings` argument or contribute them to the [mappings registry](/src/llmcompressor/modifiers/awq/mappings.py).

### AutoRound details

AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. It processes each decoder layer sequentially using block-wise output reconstruction error as the training objective.

**When to use AutoRound:**

- **INT4 for large models (≈30B+):** Performance comparable to other PTQ methods; accuracy drop is generally minimal at this scale.
- **INT4 for small-to-medium models:** Likely to deliver higher accuracy than other PTQ methods.
- **Sub-4-bit (INT2/INT3):** Shows 10–20% absolute accuracy improvements over PTQ methods, matching QAT performance at 1–2 orders of magnitude lower tuning cost.
- **New data types (MXFP4/NVFP4):** Consistently outperforms RTN in accuracy for emerging floating-point formats.

**Key parameters:**

| Parameter | Description | Default |
|-----------|-------------|---------|
| `scheme` | Quantization scheme (e.g. `W4A16`, `W8A16`) | — |
| `iters` | Tuning iterations per block | 200 |
| `batch_size` | Batch size for calibration | 8 |
| `lr` | Learning rate; auto-set to `1.0/iters` if `None` | `None` |

**Recommended configurations:**

| Mode | Batch Size | Iters | Seq Length | Samples | Speed | Memory | Accuracy |
|------|------------|-------|------------|---------|-------|--------|----------|
| `default` | 8 | 200 | 2048 | 128 | Fast | Medium | Good |
| `best` | 8 | 1000 | 2048 | 512 | Slow | High | Best |
| `light` | 8 | 50 | 2048 | 128 | Fastest | Medium | Slight drop |
| `fast` | 4 | 200 | 512 | 128 | Fastest | Low | Good |

!!! note
AutoRound currently supports WNA16, NVFP4, and W8A8-FP8 quantization schemes. Support for additional schemes is planned; follow progress in the [RFC](https://github.com/vllm-project/llm-compressor/issues/1968).

## KV cache and attention quantization

KV cache quantization reduces memory usage for long context inference:
Expand Down
80 changes: 0 additions & 80 deletions examples/autoround/README.md

This file was deleted.

47 changes: 0 additions & 47 deletions examples/awq/README.md

This file was deleted.

61 changes: 0 additions & 61 deletions examples/awq/RESULTS.md

This file was deleted.

80 changes: 6 additions & 74 deletions ...autoround/quantization_w4a4_fp4/README.md → ...odels/llama3/w4a4_fp4/autoround/README.md
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,15 @@

AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance.

## Installation

To get started, install:

```bash
git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .
```

## Quickstart

The example includes end-to-end scripts for applying the AutoRound quantization algorithm.

### Llama 3.1 Example
## Llama 3.1 Example

```bash
python3 llama3.1_example.py
```

The resulting model `Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound` is ready to be loaded into vLLM.

#### Evaluate Accuracy
### Evaluate Accuracy

With the model created, we can now load and run in vLLM (after installing).

Expand All @@ -47,86 +33,32 @@ lm_eval --model vllm \
--batch_size 'auto'
```

##### meta-llama/Meta-Llama-3.1-8B-Instruct
#### meta-llama/Meta-Llama-3.1-8B-Instruct
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7710|± |0.0116|
| | |strict-match | 5|exact_match|↑ |0.7043|± |0.0126|

##### Meta-Llama-3.1-8B-Instruct-NVFP4 (QuantizationModifier)
#### Meta-Llama-3.1-8B-Instruct-NVFP4 (QuantizationModifier)
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7248|± |0.0123|
| | |strict-match | 5|exact_match|↑ |0.6611|± |0.0130|


##### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=0)
#### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=0)
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7362|± |0.0121|
| | |strict-match | 5|exact_match|↑ |0.6702|± |0.0129|

##### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200)
#### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200)
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7210|± |0.0124|
| | |strict-match | 5|exact_match|↑ |0.6945|± |0.0127|

> Note: quantized model accuracy may vary slightly due to nondeterminism.

### Qwen3-VL Example

```bash
python3 qwen3_vl_example.py
```

The resulting model `Qwen3-VL-8B-Instruct-NVFP4-AutoRound` is ready to be loaded into vLLM.

#### Evaluate Accuracy

Run the following to test accuracy on GSM-8K and ChartQA:

```bash
lm_eval --model vllm-vlm \
--model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size 'auto'

lm_eval --model vllm-vlm \
--model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
--tasks chartqa \
--batch_size 'auto' \
--apply_chat_template
```

##### Qwen/Qwen3-VL-8B-Instruct (Baseline)
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8628|± |0.0095|
| | |strict-match | 5|exact_match|↑ |0.8453|± |0.0100|

| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|-------|------:|------|-----:|-----------------|---|-----:|---|-----:|
|chartqa| 0|none | 0|anywhere_accuracy|↑ |0.7908|± |0.0081|
| | |none | 0|exact_match |↑ |0.5592|± |0.0099|
| | |none | 0|relaxed_accuracy |↑ |0.7696|± |0.0084|


##### Qwen3-VL-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200)
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8415|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8408|± |0.0101|

| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|-------|------:|------|-----:|-----------------|---|-----:|---|-----:|
|chartqa| 0|none | 0|anywhere_accuracy|↑ |0.8220|± |0.0077|
| | |none | 0|exact_match |↑ |0.5748|± |0.0099|
| | |none | 0|relaxed_accuracy |↑ |0.8044|± |0.0079|

> Note: quantized model accuracy may vary slightly due to nondeterminism.

### Questions or Feature Request?

Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) or [intel/auto-round](https://github.com/intel/auto-round).
Loading
Loading