vllm-project
diff --git a/‎README.md‎
Lines changed: 3 additions & 2 deletions b/‎README.md‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docs/index.md‎
Lines changed: 9 additions & 6 deletions b/‎docs/index.md‎
Lines changed: 9 additions & 6 deletions
diff --git a/‎examples/awq/llama_example.py‎
Lines changed: 2 additions & 0 deletions b/‎examples/awq/llama_example.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/quantization_non_uniform/README.md‎
Lines changed: 11 additions & 0 deletions b/‎examples/quantization_non_uniform/README.md‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎examples/quantization_w8a8_fp8/qwen_2_5_vl_example.py‎
Lines changed: 37 additions & 0 deletions b/‎examples/quantization_w8a8_fp8/qwen_2_5_vl_example.py‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎examples/transform/quip_example.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/transform/quip_example.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎setup.py‎
Lines changed: 58 additions & 15 deletions b/‎setup.py‎
Lines changed: 58 additions & 15 deletions
diff --git a/‎src/llmcompressor/modeling/llama4.py‎
Lines changed: 9 additions & 1 deletion b/‎src/llmcompressor/modeling/llama4.py‎
Lines changed: 9 additions & 1 deletion
@@ -18,11 +18,11 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
 
 Some of the exciting new features include:
 
+* **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization.
 * **DeepSeekV3-style Block Quantization Support**:  This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py). 
 * **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
+* **FP4 Quantization - now with MoE and non-uniform support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to fp8 for better recovery. You can also mix other quantization schemes, such as int8 and int4.
 * **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
-* **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.
-* **Updated AWQ Support:** Improved support for MoEs with better handling of larger models
 * **Axolotl Sparse Finetuning Integration:** Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
 
 ### Supported Formats
@@ -62,6 +62,7 @@ Applying quantization with `llmcompressor`:
 * [Quantizing MoE LLMs](examples/quantizing_moe/README.md)
 * [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)
 * [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)
+* [Quantizing Models Non-uniformly](examples/quantization_non_uniform/README.md)
 
 ### User Guides
 Deep dives into advanced usage of `llmcompressor`:
 
@@ -15,18 +15,21 @@
 
 ## Recent Updates
 
+!!! info "QuIP and SpinQuant-style Transforms" 
+    The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow you to quantize models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit-weight and activation quantization.
+
+!!! info "DeepSeekV3-style Block Quantization Support" 
+    Allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8.md).
+
+!!! info "FP4 Quantization - now with MoE and non-uniform support" 
+    Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the [NVFP4 configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [FP4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to FP8 for better recovery. You can also mix other quantization schemes, such as INT8 and INT4.
+
 !!! info "Llama4 Quantization Support"
     Quantize a Llama4 model to [W4A16](examples/quantization_w4a16.md) or [NVFP4](examples/quantization_w4a16.md). The checkpoint produced can seamlessly run in vLLM.
 
 !!! info "Large Model Support with Sequential Onloading"
     As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe.md).
 
-!!! info "Preliminary FP4 Quantization Support"
-    Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4.md) and [fp4 activation support](examples/quantization_w4a4_fp4.md). Support is currently preliminary and additional support will be added for MoEs.
-
-!!! info "Updated AWQ Support"
-    Improved support for MoEs with better handling of larger models
-
 !!! info "Axolotl Sparse Finetuning Integration"
     Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
 
 
@@ -3,6 +3,7 @@
 
 from llmcompressor import oneshot
 from llmcompressor.modifiers.awq import AWQModifier
+from llmcompressor.utils import dispatch_for_generation
 
 # Select model and load it.
 MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
@@ -64,6 +65,7 @@ def tokenize(sample):
 # Confirm generations of the quantized model look sane.
 print("\n\n")
 print("========== SAMPLE GENERATION ==============")
+dispatch_for_generation(model)
 input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
 output = model.generate(input_ids, max_new_tokens=100)
 print(tokenizer.decode(output[0]))
 
@@ -0,0 +1,11 @@
+# Non-uniform Quantization
+
+In certain cases, it may be useful to combine quantization schemes of different precisions and/or strategies to achieve better recovery. For example, in some decoder-only models, the `down_proj` layer has shown greater sensitivity, and performance can be improved by quantizing this layer to int8 or fp8 instead of int4 or fp4. The examples in this folder illustrate several cases of non-uniform quantization.
+
+## Mixed-Precision Quantization
+
+We demonstrate mixed precision by quantizing models to both int8 and int4, and in a second example, to both fp4 (specifically, nvfp4) and fp8. In both cases, we use config groups to assign higher precision to the `down_proj` layer and lower precision to the remaining linear layers. For nvfp4 and fp8, we also apply two model compressors—`nvfp4-pack-quantized` and `float-quantized`. The resulting compressed model’s config.json shows `mixed-precision` as the value for `format`, indicating that the model has been compressed using multiple formats. The specific format applied to each set of layers is specified under each config group’s `format` key.
+
+## Multiple Strategies
+
+It may also be interesting to quantize a model with two different [quantization strategies](https://github.com/neuralmagic/compressed-tensors/blob/a2bfc03e9d52824ba5d6d2a50c8741dd9bccd5d3/src/compressed_tensors/quantization/quant_args.py#L93) such as group, channel, or per-tensor. [Here](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_non_uniform/quantization_fp8_multiple_strategies.py) we apply fp8 quantization where all the attention weights are quantized using the per-channel strategy, and all the mlp weights are quantized using per-tensor. This is accomplished through defining multiple config groups in the recipe. The produced model is compressed using the `float-quantized` compressor and can be directly run in vllm.
@@ -0,0 +1,37 @@
+from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
+
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor.utils import dispatch_for_generation
+
+MODEL_ID = "Qwen/Qwen2.5-VL-7B-Instruct"
+
+# Load model.
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype="auto")
+processor = AutoProcessor.from_pretrained(MODEL_ID)
+
+# Configure the quantization algorithm and scheme.
+# In this case, we:
+#   * quantize the weights to fp8 with per channel via ptq
+#   * quantize the activations to fp8 with dynamic per token
+recipe = QuantizationModifier(
+    targets="Linear",
+    scheme="FP8_DYNAMIC",
+    ignore=["lm_head", "re:visual.*", "re:model.visual.*"],
+)
+
+# Apply quantization and save to disk in compressed-tensors format.
+oneshot(model=model, recipe=recipe)
+
+# Confirm generations of the quantized model look sane.
+print("========== SAMPLE GENERATION ==============")
+dispatch_for_generation(model)
+input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
+output = model.generate(input_ids, max_new_tokens=20)
+print(processor.decode(output[0]))
+print("==========================================")
+
+# Save to disk in compressed-tensors format.
+SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+processor.save_pretrained(SAVE_DIR)
@@ -21,7 +21,7 @@
 #   * apply spinquant transforms to model in order to make quantization easier
 #   * quantize the weights to 4 bit with a group size 128
 recipe = [
-    QuIPModifier(transform_type="random-hadamard"),
+    QuIPModifier(targets="Linear", transform_type="random-hadamard"),
     QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
 ]
 
 
@@ -110,24 +110,67 @@ def localversion_func(version: ScmVersion) -> str:
         "src", include=["llmcompressor", "llmcompressor.*"], exclude=["*.__pycache__.*"]
     ),
     install_requires=[
-        "loguru>=0.7.2",
-        "pyyaml>=5.0.0",
+        (
+            "loguru>=0.7.2,<=0.7.3"
+            if BUILD_TYPE == "release"
+            else "loguru>=0.7.2"
+        ),
+        (
+            "pyyaml>=6.0.1,<=6.0.2"
+            if BUILD_TYPE == "release"
+            else "pyyaml>=6.0.1"
+        ),
         # librosa dependency numba is currently not compatible with numpy>=2.3
         # https://numba.readthedocs.io/en/stable/user/installing.html#version-support-information
-        "numpy>=1.17.0,<2.3",
-        "requests>=2.0.0",
-        "tqdm>=4.0.0",
-        # torch 1.10 and 1.11 do not support quantized onnx export
-        "torch>=1.7.0,!=1.10,!=1.11",
-        "transformers>4.0",
-        "datasets>=3.0.0",
-        "accelerate>=0.20.3,!=1.1.0",
-        "pynvml>=11.5.3",
-        "pillow>=10.4.0",
         (
-            "compressed-tensors==0.10.2"
+            "numpy>=2.0.0,<=2.3.2"
+            if BUILD_TYPE == "release"
+            else "numpy>=2.0.0"
+        ),
+        (
+            "requests>=2.32.2,<=2.32.5"
+            if BUILD_TYPE == "release"
+            else "requests>=2.32.2"
+        ),
+        (
+            "tqdm>=4.66.3,<=4.67.1"
+            if BUILD_TYPE == "release"
+            else "tqdm>=4.66.3"
+        ),
+        (
+            "torch>=2.7.0,<=2.8.0"
+            if BUILD_TYPE == "release"
+            else "torch>=2.7.0"
+        ),
+        (
+            "transformers>=4.53.0,<=4.55.2"
+            if BUILD_TYPE == "release"
+            else "transformers>=4.53.0"
+        ),
+        (
+            "datasets>=4.0.0,<=4.0.0"
+            if BUILD_TYPE == "release"
+            else "datasets>=4.0.0"
+        ),
+        (
+            "accelerate>=1.6.0,<=1.10.0"
+            if BUILD_TYPE == "release"
+            else "accelerate>=1.6.0"
+        ),
+        (
+            "pynvml>=11.5.3,<=12.0.0"
+            if BUILD_TYPE == "release"
+            else "pynvml>=11.5.3"
+        ),
+        (
+            "pillow>=10.4.0,<=10.4.0"
+            if BUILD_TYPE == "release"
+            else "pillow>=10.4.0"
+        ),
+        (
+            "compressed-tensors==0.11.0"
             if BUILD_TYPE == "release"
-            else "compressed-tensors>=0.10.3a2"
+            else "compressed-tensors>=0.11.1a2"
         ),
     ],
     extras_require={
@@ -144,7 +187,7 @@ def localversion_func(version: ScmVersion) -> str:
             "trl>=0.10.1",
             "pandas<2.3.0",
             "torchvision",
-            "librosa",
+            "librosa==0.11.0",
             "soundfile",
             "torchcodec",
             # linting, formatting, and type checking
 
@@ -1,6 +1,8 @@
 from typing import Tuple
 
 import torch
+import transformers
+from packaging import version
 from transformers.models.llama4.configuration_llama4 import (
     Llama4Config,
     Llama4TextConfig,
@@ -27,6 +29,9 @@ def __init__(self, config: Llama4TextConfig, original: Llama4TextMoe):
     def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.tensor]:
         hidden_states = hidden_states.reshape(-1, self.hidden_dim)
         router_logits = self.router(hidden_states)
+        # support transformers 4.53 and greater
+        if isinstance(router_logits, tuple):
+            router_logits = router_logits[-1]
 
         router_top_value, router_indices = torch.topk(router_logits, self.top_k, dim=1)
 
@@ -41,7 +46,10 @@ def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.tens
         for i in range(self.num_experts):
             out += self.experts[i](hidden_states) * router_scores[i].reshape(-1, 1)
 
-        return out, router_scores
+        if version.parse(transformers.__version__) >= version.parse("4.54.0"):
+            return out, router_logits
+        else:
+            return out, router_scores
 
 
 class SequentialLlama4TextExperts(torch.nn.ModuleList):
Original file line number	Diff line number	Diff line change
`@@ -21,7 +21,7 @@`
`21`	`21`	`# * apply spinquant transforms to model in order to make quantization easier`
`22`	`22`	`# * quantize the weights to 4 bit with a group size 128`
`23`	`23`	`recipe = [`
`24`		`- QuIPModifier(transform_type="random-hadamard"),`
	`24`	`+ QuIPModifier(targets="Linear", transform_type="random-hadamard"),`
`25`	`25`	`QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),`
`26`	`26`	`]`
`27`	`27`