Skip to content

Commit 2d54f29

Browse files
committed
Merge remote-tracking branch 'origin' into kylesayrs/transform-spinquant-r4
2 parents 714d655 + e5591f4 commit 2d54f29

File tree

13 files changed

+283
-111
lines changed

13 files changed

+283
-111
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
1818

1919
Some of the exciting new features include:
2020

21+
* **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization.
2122
* **DeepSeekV3-style Block Quantization Support**: This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py).
2223
* **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
24+
* **FP4 Quantization - now with MoE and non-uniform support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to fp8 for better recovery. You can also mix other quantization schemes, such as int8 and int4.
2325
* **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
24-
* **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.
25-
* **Updated AWQ Support:** Improved support for MoEs with better handling of larger models
2626
* **Axolotl Sparse Finetuning Integration:** Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
2727

2828
### Supported Formats
@@ -62,6 +62,7 @@ Applying quantization with `llmcompressor`:
6262
* [Quantizing MoE LLMs](examples/quantizing_moe/README.md)
6363
* [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)
6464
* [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)
65+
* [Quantizing Models Non-uniformly](examples/quantization_non_uniform/README.md)
6566

6667
### User Guides
6768
Deep dives into advanced usage of `llmcompressor`:

docs/index.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,21 @@
1515

1616
## Recent Updates
1717

18+
!!! info "QuIP and SpinQuant-style Transforms"
19+
The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow you to quantize models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit-weight and activation quantization.
20+
21+
!!! info "DeepSeekV3-style Block Quantization Support"
22+
Allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8.md).
23+
24+
!!! info "FP4 Quantization - now with MoE and non-uniform support"
25+
Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the [NVFP4 configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [FP4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to FP8 for better recovery. You can also mix other quantization schemes, such as INT8 and INT4.
26+
1827
!!! info "Llama4 Quantization Support"
1928
Quantize a Llama4 model to [W4A16](examples/quantization_w4a16.md) or [NVFP4](examples/quantization_w4a16.md). The checkpoint produced can seamlessly run in vLLM.
2029

2130
!!! info "Large Model Support with Sequential Onloading"
2231
As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe.md).
2332

24-
!!! info "Preliminary FP4 Quantization Support"
25-
Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4.md) and [fp4 activation support](examples/quantization_w4a4_fp4.md). Support is currently preliminary and additional support will be added for MoEs.
26-
27-
!!! info "Updated AWQ Support"
28-
Improved support for MoEs with better handling of larger models
29-
3033
!!! info "Axolotl Sparse Finetuning Integration"
3134
Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
3235

examples/awq/llama_example.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
from llmcompressor import oneshot
55
from llmcompressor.modifiers.awq import AWQModifier
6+
from llmcompressor.utils import dispatch_for_generation
67

78
# Select model and load it.
89
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
@@ -64,6 +65,7 @@ def tokenize(sample):
6465
# Confirm generations of the quantized model look sane.
6566
print("\n\n")
6667
print("========== SAMPLE GENERATION ==============")
68+
dispatch_for_generation(model)
6769
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
6870
output = model.generate(input_ids, max_new_tokens=100)
6971
print(tokenizer.decode(output[0]))
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Non-uniform Quantization
2+
3+
In certain cases, it may be useful to combine quantization schemes of different precisions and/or strategies to achieve better recovery. For example, in some decoder-only models, the `down_proj` layer has shown greater sensitivity, and performance can be improved by quantizing this layer to int8 or fp8 instead of int4 or fp4. The examples in this folder illustrate several cases of non-uniform quantization.
4+
5+
## Mixed-Precision Quantization
6+
7+
We demonstrate mixed precision by quantizing models to both int8 and int4, and in a second example, to both fp4 (specifically, nvfp4) and fp8. In both cases, we use config groups to assign higher precision to the `down_proj` layer and lower precision to the remaining linear layers. For nvfp4 and fp8, we also apply two model compressors—`nvfp4-pack-quantized` and `float-quantized`. The resulting compressed model’s config.json shows `mixed-precision` as the value for `format`, indicating that the model has been compressed using multiple formats. The specific format applied to each set of layers is specified under each config group’s `format` key.
8+
9+
## Multiple Strategies
10+
11+
It may also be interesting to quantize a model with two different [quantization strategies](https://github.com/neuralmagic/compressed-tensors/blob/a2bfc03e9d52824ba5d6d2a50c8741dd9bccd5d3/src/compressed_tensors/quantization/quant_args.py#L93) such as group, channel, or per-tensor. [Here](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_non_uniform/quantization_fp8_multiple_strategies.py) we apply fp8 quantization where all the attention weights are quantized using the per-channel strategy, and all the mlp weights are quantized using per-tensor. This is accomplished through defining multiple config groups in the recipe. The produced model is compressed using the `float-quantized` compressor and can be directly run in vllm.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
2+
3+
from llmcompressor import oneshot
4+
from llmcompressor.modifiers.quantization import QuantizationModifier
5+
from llmcompressor.utils import dispatch_for_generation
6+
7+
MODEL_ID = "Qwen/Qwen2.5-VL-7B-Instruct"
8+
9+
# Load model.
10+
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype="auto")
11+
processor = AutoProcessor.from_pretrained(MODEL_ID)
12+
13+
# Configure the quantization algorithm and scheme.
14+
# In this case, we:
15+
# * quantize the weights to fp8 with per channel via ptq
16+
# * quantize the activations to fp8 with dynamic per token
17+
recipe = QuantizationModifier(
18+
targets="Linear",
19+
scheme="FP8_DYNAMIC",
20+
ignore=["lm_head", "re:visual.*", "re:model.visual.*"],
21+
)
22+
23+
# Apply quantization and save to disk in compressed-tensors format.
24+
oneshot(model=model, recipe=recipe)
25+
26+
# Confirm generations of the quantized model look sane.
27+
print("========== SAMPLE GENERATION ==============")
28+
dispatch_for_generation(model)
29+
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
30+
output = model.generate(input_ids, max_new_tokens=20)
31+
print(processor.decode(output[0]))
32+
print("==========================================")
33+
34+
# Save to disk in compressed-tensors format.
35+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
36+
model.save_pretrained(SAVE_DIR, save_compressed=True)
37+
processor.save_pretrained(SAVE_DIR)

examples/transform/quip_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
# * apply spinquant transforms to model in order to make quantization easier
2222
# * quantize the weights to 4 bit with a group size 128
2323
recipe = [
24-
QuIPModifier(transform_type="random-hadamard"),
24+
QuIPModifier(targets="Linear", transform_type="random-hadamard"),
2525
QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
2626
]
2727

setup.py

Lines changed: 58 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -110,24 +110,67 @@ def localversion_func(version: ScmVersion) -> str:
110110
"src", include=["llmcompressor", "llmcompressor.*"], exclude=["*.__pycache__.*"]
111111
),
112112
install_requires=[
113-
"loguru>=0.7.2",
114-
"pyyaml>=5.0.0",
113+
(
114+
"loguru>=0.7.2,<=0.7.3"
115+
if BUILD_TYPE == "release"
116+
else "loguru>=0.7.2"
117+
),
118+
(
119+
"pyyaml>=6.0.1,<=6.0.2"
120+
if BUILD_TYPE == "release"
121+
else "pyyaml>=6.0.1"
122+
),
115123
# librosa dependency numba is currently not compatible with numpy>=2.3
116124
# https://numba.readthedocs.io/en/stable/user/installing.html#version-support-information
117-
"numpy>=1.17.0,<2.3",
118-
"requests>=2.0.0",
119-
"tqdm>=4.0.0",
120-
# torch 1.10 and 1.11 do not support quantized onnx export
121-
"torch>=1.7.0,!=1.10,!=1.11",
122-
"transformers>4.0",
123-
"datasets>=3.0.0",
124-
"accelerate>=0.20.3,!=1.1.0",
125-
"pynvml>=11.5.3",
126-
"pillow>=10.4.0",
127125
(
128-
"compressed-tensors==0.10.2"
126+
"numpy>=2.0.0,<=2.3.2"
127+
if BUILD_TYPE == "release"
128+
else "numpy>=2.0.0"
129+
),
130+
(
131+
"requests>=2.32.2,<=2.32.5"
132+
if BUILD_TYPE == "release"
133+
else "requests>=2.32.2"
134+
),
135+
(
136+
"tqdm>=4.66.3,<=4.67.1"
137+
if BUILD_TYPE == "release"
138+
else "tqdm>=4.66.3"
139+
),
140+
(
141+
"torch>=2.7.0,<=2.8.0"
142+
if BUILD_TYPE == "release"
143+
else "torch>=2.7.0"
144+
),
145+
(
146+
"transformers>=4.53.0,<=4.55.2"
147+
if BUILD_TYPE == "release"
148+
else "transformers>=4.53.0"
149+
),
150+
(
151+
"datasets>=4.0.0,<=4.0.0"
152+
if BUILD_TYPE == "release"
153+
else "datasets>=4.0.0"
154+
),
155+
(
156+
"accelerate>=1.6.0,<=1.10.0"
157+
if BUILD_TYPE == "release"
158+
else "accelerate>=1.6.0"
159+
),
160+
(
161+
"pynvml>=11.5.3,<=12.0.0"
162+
if BUILD_TYPE == "release"
163+
else "pynvml>=11.5.3"
164+
),
165+
(
166+
"pillow>=10.4.0,<=10.4.0"
167+
if BUILD_TYPE == "release"
168+
else "pillow>=10.4.0"
169+
),
170+
(
171+
"compressed-tensors==0.11.0"
129172
if BUILD_TYPE == "release"
130-
else "compressed-tensors>=0.10.3a2"
173+
else "compressed-tensors>=0.11.1a2"
131174
),
132175
],
133176
extras_require={
@@ -144,7 +187,7 @@ def localversion_func(version: ScmVersion) -> str:
144187
"trl>=0.10.1",
145188
"pandas<2.3.0",
146189
"torchvision",
147-
"librosa",
190+
"librosa==0.11.0",
148191
"soundfile",
149192
"torchcodec",
150193
# linting, formatting, and type checking

src/llmcompressor/modeling/llama4.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
from typing import Tuple
22

33
import torch
4+
import transformers
5+
from packaging import version
46
from transformers.models.llama4.configuration_llama4 import (
57
Llama4Config,
68
Llama4TextConfig,
@@ -27,6 +29,9 @@ def __init__(self, config: Llama4TextConfig, original: Llama4TextMoe):
2729
def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.tensor]:
2830
hidden_states = hidden_states.reshape(-1, self.hidden_dim)
2931
router_logits = self.router(hidden_states)
32+
# support transformers 4.53 and greater
33+
if isinstance(router_logits, tuple):
34+
router_logits = router_logits[-1]
3035

3136
router_top_value, router_indices = torch.topk(router_logits, self.top_k, dim=1)
3237

@@ -41,7 +46,10 @@ def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.tens
4146
for i in range(self.num_experts):
4247
out += self.experts[i](hidden_states) * router_scores[i].reshape(-1, 1)
4348

44-
return out, router_scores
49+
if version.parse(transformers.__version__) >= version.parse("4.54.0"):
50+
return out, router_logits
51+
else:
52+
return out, router_scores
4553

4654

4755
class SequentialLlama4TextExperts(torch.nn.ModuleList):

0 commit comments

Comments
 (0)