Skip to content

Commit 21cb4f0

Browse files
authored
Merge branch 'main' into INFERENG-1867
2 parents d564e40 + b4a99d4 commit 21cb4f0

35 files changed

+772
-112
lines changed

Makefile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,10 +28,13 @@ quality:
2828
ruff format --check $(CHECKDIRS);
2929

3030
# style the code according to accepted standards for the repo
31+
# Note: We run `ruff format` twice. Once to fix long lines before lint check
32+
# and again to fix any formatting issues introduced by ruff check --fix
3133
style:
3234
@echo "Running python styling";
35+
ruff format $(CHECKDIRS);
3336
ruff check --fix $(CHECKDIRS);
34-
ruff format $(CHECKDIRS);
37+
ruff format --silent $(CHECKDIRS);
3538

3639
# run tests for the repo
3740
test:

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
3737

3838
Some of the exciting new features include:
3939

40+
* **Qwen3 Next and Qwen3 VL MoE Quantization Support**: Quantize the Qwen3 Next and Qwen3 VL MoE models and seamlessly run the models in vLLM. Examples for [NVFP4](examples/quantization_w4a4_fp4/qwen3_next_example.py) and [FP8](examples/quantization_w8a8_fp8/qwen3_next_example.py) Quantization have been added for the Qwen3-Next-80B-A3B-Instruct. For the Qwen3 VL MoE, support has been added for the datafree pathway, specifically [FP8 Quantization](examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py) (e.g channel-wise and block-wise quantization). NOTE: these models are not supported in tranformers<=4.56.2. You may need to install transformers from source.
4041
* **Quantization with Multiple Modifiers**: Multiple quantization modifiers can now be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This is an advanced usage of `llm-compressor` and an active area of research. See the [non-uniform quantization support](examples/quantization_non_uniform) section for more detail and [example usage](examples/quantization_non_uniform/quantization_multiple_modifiers.py).
4142
* **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization.
4243
* **DeepSeekV3-style Block Quantization Support**: This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py).

docs/index.md

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,29 @@
1313
<img alt="LLM Compressor Flow" src="assets/llmcompressor-user-flows.png" width="100%" style="max-width: 100%;"/>
1414
</p>
1515

16+
## New in this release
17+
18+
Review the [LLM Compressor v0.8.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.8.0) for details about new features. Highlights include:
19+
20+
!!! info "Support for multiple modifiers in oneshot compression runs"
21+
LLM Compressor now supports using multiple modifiers in oneshot compression runs such as applying both AWQ and GPTQ in a single model.
22+
23+
Using multiple modifiers is an advanced usage of LLM Compressor and an active area of research. See [Non-uniform Quantization](examples/quantization_non_uniform/) for more detail and example usage.
24+
25+
!!! info "Quantization and calibration support for Qwen3 models"
26+
Quantization and calibration support for Qwen3 Next models has been added to LLM Compressor.
27+
28+
LLM Compressor now supports quantization for Qwen3 Next and Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release.
29+
30+
Examples for NVFP4 and FP8 quantization have been added for the Qwen3-Next-80B-A3B-Instruct model.
31+
32+
For the Qwen3 VL MoE model, support has been added for the data-free pathway. The data-free pathway applies FP8 quantization, for example, channel-wise and block-wise quantization.
33+
34+
**NOTE**: These models are not supported in tranformers<=4.56.2. You may need to install transformers from source.
35+
36+
!!! info "Transforms support for non-full-size rotation sizes"
37+
You can now set a `transform_block_size` field in the Transform-based modifier classes `SpinQuantModifier` and `QuIPModifier`. You can configure transforms of variable size with this field, and you don't need to restrict hadamards to match the size of the weight.
38+
1639
## Recent Updates
1740

1841
!!! info "QuIP and SpinQuant-style Transforms"
@@ -27,12 +50,6 @@
2750
!!! info "Llama4 Quantization Support"
2851
Quantize a Llama4 model to [W4A16](examples/quantization_w4a16.md) or [NVFP4](examples/quantization_w4a16.md). The checkpoint produced can seamlessly run in vLLM.
2952

30-
!!! info "Large Model Support with Sequential Onloading"
31-
As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe.md).
32-
33-
!!! info "Axolotl Sparse Finetuning Integration"
34-
Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
35-
3653
For more information, check out the [latest release on GitHub](https://github.com/vllm-project/llm-compressor/releases/latest).
3754

3855
## Key Features

examples/multimodal_vision/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ recipe = [
3737
targets="Linear",
3838
scheme="W4A16",
3939
sequential_targets=["MistralDecoderLayer"],
40-
ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
40+
ignore=["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"],
4141
),
4242
]
4343
```

examples/multimodal_vision/llama4_example.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,11 @@ def preprocess_function(example):
5252
def data_collator(batch):
5353
assert len(batch) == 1
5454
return {
55-
key: torch.tensor(value)
56-
if key != "pixel_values"
57-
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
55+
key: (
56+
torch.tensor(value)
57+
if key != "pixel_values"
58+
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
59+
)
5860
for key, value in batch[0].items()
5961
}
6062

@@ -67,8 +69,8 @@ def data_collator(batch):
6769
"re:.*lm_head",
6870
"re:.*self_attn",
6971
"re:.*router",
70-
"re:vision_model.*",
71-
"re:multi_modal_projector.*",
72+
"re:.*vision_model.*",
73+
"re:.*multi_modal_projector.*",
7274
"Llama4TextAttention",
7375
],
7476
)

examples/multimodal_vision/llava_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ def data_collator(batch):
3030
GPTQModifier(
3131
targets="Linear",
3232
scheme="W4A16",
33-
ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
33+
ignore=["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"],
3434
),
3535
]
3636

examples/multimodal_vision/mistral3_example.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,11 @@
3131
def data_collator(batch):
3232
assert len(batch) == 1
3333
return {
34-
key: torch.tensor(value)
35-
if key != "pixel_values"
36-
else torch.tensor(value, dtype=model.dtype)
34+
key: (
35+
torch.tensor(value)
36+
if key != "pixel_values"
37+
else torch.tensor(value, dtype=model.dtype)
38+
)
3739
for key, value in batch[0].items()
3840
}
3941

@@ -43,7 +45,7 @@ def data_collator(batch):
4345
GPTQModifier(
4446
targets="Linear",
4547
scheme="W4A16",
46-
ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
48+
ignore=["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"],
4749
),
4850
]
4951

examples/multimodal_vision/mllama_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ def data_collator(batch):
3030
GPTQModifier(
3131
targets="Linear",
3232
scheme="W4A16",
33-
ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"],
33+
ignore=["re:.*lm_head", "re:.*multi_modal_projector.*", "re:.*vision_model.*"],
3434
),
3535
]
3636

examples/multimodal_vision/pixtral_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ def data_collator(batch):
3636
GPTQModifier(
3737
targets="Linear",
3838
scheme="W4A16",
39-
ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
39+
ignore=["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"],
4040
),
4141
]
4242

examples/quantization_w4a4_fp4/llama4_example.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,11 @@ def preprocess_function(example):
5252
def data_collator(batch):
5353
assert len(batch) == 1
5454
return {
55-
key: torch.tensor(value)
56-
if key != "pixel_values"
57-
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
55+
key: (
56+
torch.tensor(value)
57+
if key != "pixel_values"
58+
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
59+
)
5860
for key, value in batch[0].items()
5961
}
6062

@@ -67,8 +69,8 @@ def data_collator(batch):
6769
"re:.*lm_head",
6870
"re:.*self_attn",
6971
"re:.*router",
70-
"re:vision_model.*",
71-
"re:multi_modal_projector.*",
72+
"re:.*vision_model.*",
73+
"re:.*multi_modal_projector.*",
7274
"Llama4TextAttention",
7375
],
7476
)

0 commit comments

Comments
 (0)