|
| 1 | +# Applying Transforms to Improve Quantization Accuracy |
| 2 | + |
| 3 | +This directory contains example scripts for applying transforms to models for the purpose of improving quantization accuracy. For more information on transforms, see [QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs](https://arxiv.org/abs/2404.00456). The two transform styles currently supported are SpinQuant/QuaRot-style (`SpinQuantModifier`), and QuIP-style (`QuIPModifier`). |
| 4 | + |
| 5 | +See also [[vLLM Office Hours #31] vLLM and LLM Compressor Update - August 28, 2025](https://www.youtube.com/watch?v=WVenRmF4dPY&list=PLbMP1JcGBmSHxp4-lubU5WYmJ9YgAQcf3&index=3). |
| 6 | + |
| 7 | +## Installation |
| 8 | + |
| 9 | +To get started, install the necessary dependencies by executing the following commands: |
| 10 | + |
| 11 | +```bash |
| 12 | +git clone https://github.com/vllm-project/llm-compressor.git |
| 13 | +cd llm-compressor |
| 14 | +pip install -e . |
| 15 | +``` |
| 16 | + |
| 17 | +## Quickstart |
| 18 | + |
| 19 | +The provided example script demonstrates the process for applying quip-style transforms before quantization. |
| 20 | + |
| 21 | +```bash |
| 22 | +python3 quip_example.py |
| 23 | +``` |
| 24 | + |
| 25 | +### Step 1: Select a Model, Dataset, and Recipe |
| 26 | + |
| 27 | +In this step, you'll choose a base model for quantization and a transformation + quantization recipe. |
| 28 | + |
| 29 | +- **Models**: Can be referenced from a local directory or retrieved from the Hugging Face Hub. |
| 30 | +- **Recipes**: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use the `QuIPModifier` applied before the `QuantizationModifier` with the scheme set to `FP8`. |
| 31 | + |
| 32 | +```python |
| 33 | +from llmcompressor.modifiers.transform import QuIPModifier |
| 34 | +from llmcompressor.modifiers.quantization import QuantizationModifier |
| 35 | + |
| 36 | +recipe = [ |
| 37 | + QuIPModifier( |
| 38 | + rotations=["v", "u"], transform_block_size=128, transform_type="hadamard" |
| 39 | + ), |
| 40 | + QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]), |
| 41 | +] |
| 42 | +``` |
| 43 | + |
| 44 | +Note that `QuIPModifier` can be customized. For a full list of the available arguments, see the [docstring](/src/llmcompressor/modifiers/transform/spinquant/base.py) or documentation. |
| 45 | + |
| 46 | +* `rotations` determines which of the input rotation (v) or output rotations (u) should be used. |
| 47 | +* `transform_block_size` determines the size of the hadamard. Smaller hadamards require less cost at runtime. |
| 48 | +* `transform_type` determines how the transform is constrcted. hadamard uses the sylvester construction. |
| 49 | + |
| 50 | +### Step 2: Run Quantization Using Oneshot |
| 51 | + |
| 52 | +The `oneshot` method applies the selected recipe to your model and dataset without requiring any fine-tuning. The model will be quantized and saved to `Llama-3.1-8B-Instruct-quip-w4a16`. We use the "datafree" pipeline, since our recipe does not require calibration data. |
| 53 | + |
| 54 | +```python |
| 55 | +from llmcompressor import oneshot |
| 56 | + |
| 57 | +# Apply algorithms. |
| 58 | +oneshot(model=model, recipe=recipe, pipeline="datafree") |
| 59 | + |
| 60 | +# Save to disk compressed. |
| 61 | +SAVE_DIR = MODEL_ID.split("/")[1] + "-quip-w4a16" |
| 62 | +model.save_pretrained(SAVE_DIR, save_compressed=True) |
| 63 | +tokenizer.save_pretrained(SAVE_DIR) |
| 64 | +``` |
| 65 | + |
| 66 | +### Step 3: Run optimized model in vLLM |
| 67 | +Models optimized with the `hadamard` transform type will be able to leverage the hadacore kernels for accelerated inference. Use the [benchmarks/latency.py](https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/latency.py) script to benchmark latency |
| 68 | + |
| 69 | +```bash |
| 70 | +python3 benchmarks/benchmark_latency.py --model path/to/Llama-3.2-1B-Instruct-quip-w4a16 |
| 71 | +``` |
| 72 | + |
| 73 | + |
| 74 | +#### Dense Model Latency (sec) #### |
| 75 | +| [Base](https://huggingface.co/meta-llama/Llama-3.2-1B-instruct) | Hadacore | GEMM | |
| 76 | +| - | - | - | |
| 77 | +| 0.4710 | 0.4948 | 1.3946 | |
| 78 | + |
| 79 | +#### Quantized Model Latency (sec) #### |
| 80 | +| Base W4A16 | Hadacore | GEMM | |
| 81 | +| - | - | - | |
| 82 | +| 0.4402 | 0.4489 | 1.2917 | |
0 commit comments