Skip to content

Commit 640147b

Browse files
brian-dellabettakylesayrsrahul-tuli
authored
[transforms] update examples so hadacore kernel is used by default (#1883)
SUMMARY: Quick follow-up to recently merged * #1870 Updates our `examples/transform` scripts to - [x] default to `transform_type="hadamard"`, which is preferred so that vllm hadacore kernel is used - [x] default to `tranform_block_size=128`, which is preferred for group-size 128 schemes like W4A16 TEST PLAN: Previously confirmed that hadacore kernel was being invoked for `transform_type="hadamard"` --------- Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Rahul Tuli <[email protected]>
1 parent 244a281 commit 640147b

File tree

4 files changed

+90
-4
lines changed

4 files changed

+90
-4
lines changed

examples/quantizing_moe/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Quantizing Mixtral-8x7B-Instruct-v0.1 Model with FP8
22

3-
This directory contains an example script for quantizing the `Mixtral-8x7B-Instruct-v0.1` model using the static per-tensor FP8 quantization scheme.
3+
This directory contains example scripts for quantizing LLMs using the static per-tensor FP8 quantization scheme.
44

55
## Installation
66

@@ -32,7 +32,7 @@ python mixtral_example.py
3232

3333
### Step 1: Select a Model, Dataset, and Recipe
3434

35-
In this step, you'll choose a baseline model for quantization, a dataset for calibration, and a quantization recipe.
35+
In this step, you'll choose a base model for quantization, a dataset for calibration, and a quantization recipe.
3636

3737
- **Models**: Can be referenced from a local directory or retrieved from the Hugging Face Hub.
3838
- **Datasets**: Can also be from a local directory or the Hugging Face Hub.

examples/transform/README.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Applying Transforms to Improve Quantization Accuracy
2+
3+
This directory contains example scripts for applying transforms to models for the purpose of improving quantization accuracy. For more information on transforms, see [QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs](https://arxiv.org/abs/2404.00456). The two transform styles currently supported are SpinQuant/QuaRot-style (`SpinQuantModifier`), and QuIP-style (`QuIPModifier`).
4+
5+
See also [[vLLM Office Hours #31] vLLM and LLM Compressor Update - August 28, 2025](https://www.youtube.com/watch?v=WVenRmF4dPY&list=PLbMP1JcGBmSHxp4-lubU5WYmJ9YgAQcf3&index=3).
6+
7+
## Installation
8+
9+
To get started, install the necessary dependencies by executing the following commands:
10+
11+
```bash
12+
git clone https://github.com/vllm-project/llm-compressor.git
13+
cd llm-compressor
14+
pip install -e .
15+
```
16+
17+
## Quickstart
18+
19+
The provided example script demonstrates the process for applying quip-style transforms before quantization.
20+
21+
```bash
22+
python3 quip_example.py
23+
```
24+
25+
### Step 1: Select a Model, Dataset, and Recipe
26+
27+
In this step, you'll choose a base model for quantization and a transformation + quantization recipe.
28+
29+
- **Models**: Can be referenced from a local directory or retrieved from the Hugging Face Hub.
30+
- **Recipes**: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use the `QuIPModifier` applied before the `QuantizationModifier` with the scheme set to `FP8`.
31+
32+
```python
33+
from llmcompressor.modifiers.transform import QuIPModifier
34+
from llmcompressor.modifiers.quantization import QuantizationModifier
35+
36+
recipe = [
37+
QuIPModifier(
38+
rotations=["v", "u"], transform_block_size=128, transform_type="hadamard"
39+
),
40+
QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
41+
]
42+
```
43+
44+
Note that `QuIPModifier` can be customized. For a full list of the available arguments, see the [docstring](/src/llmcompressor/modifiers/transform/spinquant/base.py) or documentation.
45+
46+
* `rotations` determines which of the input rotation (v) or output rotations (u) should be used.
47+
* `transform_block_size` determines the size of the hadamard. Smaller hadamards require less cost at runtime.
48+
* `transform_type` determines how the transform is constrcted. hadamard uses the sylvester construction.
49+
50+
### Step 2: Run Quantization Using Oneshot
51+
52+
The `oneshot` method applies the selected recipe to your model and dataset without requiring any fine-tuning. The model will be quantized and saved to `Llama-3.1-8B-Instruct-quip-w4a16`. We use the "datafree" pipeline, since our recipe does not require calibration data.
53+
54+
```python
55+
from llmcompressor import oneshot
56+
57+
# Apply algorithms.
58+
oneshot(model=model, recipe=recipe, pipeline="datafree")
59+
60+
# Save to disk compressed.
61+
SAVE_DIR = MODEL_ID.split("/")[1] + "-quip-w4a16"
62+
model.save_pretrained(SAVE_DIR, save_compressed=True)
63+
tokenizer.save_pretrained(SAVE_DIR)
64+
```
65+
66+
### Step 3: Run optimized model in vLLM
67+
Models optimized with the `hadamard` transform type will be able to leverage the hadacore kernels for accelerated inference. Use the [benchmarks/latency.py](https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/latency.py) script to benchmark latency
68+
69+
```bash
70+
python3 benchmarks/benchmark_latency.py --model path/to/Llama-3.2-1B-Instruct-quip-w4a16
71+
```
72+
73+
74+
#### Dense Model Latency (sec) ####
75+
| [Base](https://huggingface.co/meta-llama/Llama-3.2-1B-instruct) | Hadacore | GEMM |
76+
| - | - | - |
77+
| 0.4710 | 0.4948 | 1.3946 |
78+
79+
#### Quantized Model Latency (sec) ####
80+
| Base W4A16 | Hadacore | GEMM |
81+
| - | - | - |
82+
| 0.4402 | 0.4489 | 1.2917 |

examples/transform/quip_example.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,11 @@
2020
# Configure the quantization algorithm to run.
2121
# * apply quip transforms to model in order to make quantization easier
2222
# * quantize the weights to 4 bit with a group size 128
23+
# * NOTE: if a model has activation shapes not divisble by 2^N, consider using
24+
# `random-hadamard` (random hadamard kernels will be added in the future)
2325
recipe = [
2426
QuIPModifier(
25-
rotations=["v", "u"], transform_block_size=128, transform_type="random-hadamard"
27+
rotations=["v", "u"], transform_block_size=128, transform_type="hadamard"
2628
),
2729
QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
2830
]

examples/transform/spinquant_example.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@
1818
# * quantize the weights to 4 bit with group size 128
1919
recipe = [
2020
SpinQuantModifier(
21-
rotations=["R1", "R2", "R4"], transform_block_size=64, transform_type="hadamard"
21+
rotations=["R1", "R2", "R4"],
22+
transform_block_size=128,
23+
transform_type="hadamard",
2224
),
2325
QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
2426
]

0 commit comments

Comments
 (0)