vllm-project
diff --git a/‎Makefile‎
Lines changed: 4 additions & 1 deletion b/‎Makefile‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 23 additions & 6 deletions b/‎docs/index.md‎
Lines changed: 23 additions & 6 deletions
diff --git a/‎examples/multimodal_vision/README.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/multimodal_vision/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/multimodal_vision/llama4_example.py‎
Lines changed: 7 additions & 5 deletions b/‎examples/multimodal_vision/llama4_example.py‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎examples/multimodal_vision/llava_example.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/multimodal_vision/llava_example.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/multimodal_vision/mistral3_example.py‎
Lines changed: 6 additions & 4 deletions b/‎examples/multimodal_vision/mistral3_example.py‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎examples/multimodal_vision/mllama_example.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/multimodal_vision/mllama_example.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/multimodal_vision/pixtral_example.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/multimodal_vision/pixtral_example.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/quantization_w4a4_fp4/llama4_example.py‎
Lines changed: 7 additions & 5 deletions b/‎examples/quantization_w4a4_fp4/llama4_example.py‎
Lines changed: 7 additions & 5 deletions
@@ -28,10 +28,13 @@ quality:
 	ruff format --check $(CHECKDIRS);
 
 # style the code according to accepted standards for the repo
+# Note: We run `ruff format` twice. Once to fix long lines before lint check
+# and again to fix any formatting issues introduced by ruff check --fix
 style:
 	@echo "Running python styling";
+	ruff format $(CHECKDIRS); 
 	ruff check --fix $(CHECKDIRS);
-	ruff format $(CHECKDIRS);
+	ruff format --silent $(CHECKDIRS); 
 
 # run tests for the repo
 test:
 
@@ -37,6 +37,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
 
 Some of the exciting new features include:
 
+* **Qwen3 Next and Qwen3 VL MoE Quantization Support**: Quantize the Qwen3 Next and Qwen3 VL MoE models and seamlessly run the models in vLLM. Examples for [NVFP4](examples/quantization_w4a4_fp4/qwen3_next_example.py) and [FP8](examples/quantization_w8a8_fp8/qwen3_next_example.py) Quantization have been added for the Qwen3-Next-80B-A3B-Instruct. For the Qwen3 VL MoE, support has been added for the datafree pathway, specifically [FP8 Quantization](examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py) (e.g channel-wise and block-wise quantization). NOTE: these models are not supported in tranformers<=4.56.2. You may need to install transformers from source.
 * **Quantization with Multiple Modifiers**: Multiple quantization modifiers can now be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This is an advanced usage of `llm-compressor` and an active area of research. See the [non-uniform quantization support](examples/quantization_non_uniform) section for more detail and [example usage](examples/quantization_non_uniform/quantization_multiple_modifiers.py).
 * **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization.
 * **DeepSeekV3-style Block Quantization Support**:  This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py). 
 
@@ -13,6 +13,29 @@
    <img alt="LLM Compressor Flow" src="assets/llmcompressor-user-flows.png" width="100%" style="max-width: 100%;"/>
 </p>
 
+## New in this release
+
+Review the [LLM Compressor v0.8.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.8.0) for details about new features. Highlights include:
+
+!!! info "Support for multiple modifiers in oneshot compression runs"
+    LLM Compressor now supports using multiple modifiers in oneshot compression runs such as applying both AWQ and GPTQ in a single model. 
+
+    Using multiple modifiers is an advanced usage of LLM Compressor and an active area of research. See [Non-uniform Quantization](examples/quantization_non_uniform/) for more detail and example usage.
+
+!!! info "Quantization and calibration support for Qwen3 models"
+    Quantization and calibration support for Qwen3 Next models has been added to LLM Compressor.
+
+    LLM Compressor now supports quantization for Qwen3 Next and Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release.
+
+    Examples for NVFP4 and FP8 quantization have been added for the Qwen3-Next-80B-A3B-Instruct model. 
+
+    For the Qwen3 VL MoE model, support has been added for the data-free pathway. The data-free pathway applies FP8 quantization, for example, channel-wise and block-wise quantization. 
+
+    **NOTE**: These models are not supported in tranformers<=4.56.2. You may need to install transformers from source.
+
+!!! info "Transforms support for non-full-size rotation sizes"
+    You can now set a `transform_block_size` field in the Transform-based modifier classes `SpinQuantModifier` and `QuIPModifier`. You can configure transforms of variable size with this field, and you don't need to restrict hadamards to match the size of the weight.
+
 ## Recent Updates
 
 !!! info "QuIP and SpinQuant-style Transforms" 
@@ -27,12 +50,6 @@
 !!! info "Llama4 Quantization Support"
     Quantize a Llama4 model to [W4A16](examples/quantization_w4a16.md) or [NVFP4](examples/quantization_w4a16.md). The checkpoint produced can seamlessly run in vLLM.
 
-!!! info "Large Model Support with Sequential Onloading"
-    As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe.md).
-
-!!! info "Axolotl Sparse Finetuning Integration"
-    Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
-
 For more information, check out the [latest release on GitHub](https://github.com/vllm-project/llm-compressor/releases/latest).
 
 ## Key Features
 
@@ -37,7 +37,7 @@ recipe = [
         targets="Linear",
         scheme="W4A16",
         sequential_targets=["MistralDecoderLayer"],
-        ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
+        ignore=["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"],
     ),
 ]
 ```
 
@@ -52,9 +52,11 @@ def preprocess_function(example):
 def data_collator(batch):
     assert len(batch) == 1
     return {
-        key: torch.tensor(value)
-        if key != "pixel_values"
-        else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
+        key: (
+            torch.tensor(value)
+            if key != "pixel_values"
+            else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
+        )
         for key, value in batch[0].items()
     }
 
@@ -67,8 +69,8 @@ def data_collator(batch):
         "re:.*lm_head",
         "re:.*self_attn",
         "re:.*router",
-        "re:vision_model.*",
-        "re:multi_modal_projector.*",
+        "re:.*vision_model.*",
+        "re:.*multi_modal_projector.*",
         "Llama4TextAttention",
     ],
 )
 
@@ -30,7 +30,7 @@ def data_collator(batch):
     GPTQModifier(
         targets="Linear",
         scheme="W4A16",
-        ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
+        ignore=["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"],
     ),
 ]
 
 
@@ -31,9 +31,11 @@
 def data_collator(batch):
     assert len(batch) == 1
     return {
-        key: torch.tensor(value)
-        if key != "pixel_values"
-        else torch.tensor(value, dtype=model.dtype)
+        key: (
+            torch.tensor(value)
+            if key != "pixel_values"
+            else torch.tensor(value, dtype=model.dtype)
+        )
         for key, value in batch[0].items()
     }
 
@@ -43,7 +45,7 @@ def data_collator(batch):
     GPTQModifier(
         targets="Linear",
         scheme="W4A16",
-        ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
+        ignore=["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"],
     ),
 ]
 
 
@@ -30,7 +30,7 @@ def data_collator(batch):
     GPTQModifier(
         targets="Linear",
         scheme="W4A16",
-        ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"],
+        ignore=["re:.*lm_head", "re:.*multi_modal_projector.*", "re:.*vision_model.*"],
     ),
 ]
 
 
@@ -36,7 +36,7 @@ def data_collator(batch):
     GPTQModifier(
         targets="Linear",
         scheme="W4A16",
-        ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
+        ignore=["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"],
     ),
 ]
 
 
@@ -52,9 +52,11 @@ def preprocess_function(example):
 def data_collator(batch):
     assert len(batch) == 1
     return {
-        key: torch.tensor(value)
-        if key != "pixel_values"
-        else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
+        key: (
+            torch.tensor(value)
+            if key != "pixel_values"
+            else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
+        )
         for key, value in batch[0].items()
     }
 
@@ -67,8 +69,8 @@ def data_collator(batch):
         "re:.*lm_head",
         "re:.*self_attn",
         "re:.*router",
-        "re:vision_model.*",
-        "re:multi_modal_projector.*",
+        "re:.*vision_model.*",
+        "re:.*multi_modal_projector.*",
         "Llama4TextAttention",
     ],
 )
Original file line number	Diff line number	Diff line change
`@@ -37,7 +37,7 @@ recipe = [`
`37`	`37`	`targets="Linear",`
`38`	`38`	`scheme="W4A16",`
`39`	`39`	`sequential_targets=["MistralDecoderLayer"],`
`40`		`- ignore=["re:.lm_head", "re:vision_tower.", "re:multi_modal_projector.*"],`
	`40`	`+ ignore=["re:.lm_head", "re:.vision_tower.", "re:.multi_modal_projector.*"],`
`41`	`41`	`),`
`42`	`42`	`]`
`43`	`43`	```
Original file line number	Diff line number	Diff line change
`@@ -30,7 +30,7 @@ def data_collator(batch):`
`30`	`30`	`GPTQModifier(`
`31`	`31`	`targets="Linear",`
`32`	`32`	`scheme="W4A16",`
`33`		`- ignore=["re:.lm_head", "re:vision_tower.", "re:multi_modal_projector.*"],`
	`33`	`+ ignore=["re:.lm_head", "re:.vision_tower.", "re:.multi_modal_projector.*"],`
`34`	`34`	`),`
`35`	`35`	`]`
`36`	`36`
Original file line number	Diff line number	Diff line change
`@@ -36,7 +36,7 @@ def data_collator(batch):`
`36`	`36`	`GPTQModifier(`
`37`	`37`	`targets="Linear",`
`38`	`38`	`scheme="W4A16",`
`39`		`- ignore=["re:.lm_head", "re:vision_tower.", "re:multi_modal_projector.*"],`
	`39`	`+ ignore=["re:.lm_head", "re:.vision_tower.", "re:.multi_modal_projector.*"],`
`40`	`40`	`),`
`41`	`41`	`]`
`42`	`42`