vllm-project
diff --git a/‎examples/quantization_w4a16/README.md‎
Lines changed: 99 additions & 0 deletions b/‎examples/quantization_w4a16/README.md‎
Lines changed: 99 additions & 0 deletions
diff --git a/‎examples/quantization_w4a16/llama3_imatrix_example.py‎
Lines changed: 61 additions & 0 deletions b/‎examples/quantization_w4a16/llama3_imatrix_example.py‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎src/llmcompressor/modifiers/quantization/calibration.py‎
Lines changed: 2 additions & 0 deletions b/‎src/llmcompressor/modifiers/quantization/calibration.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/llmcompressor/modifiers/transform/imatrix/__init__.py‎
Lines changed: 3 additions & 0 deletions b/‎src/llmcompressor/modifiers/transform/imatrix/__init__.py‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎src/llmcompressor/modifiers/transform/imatrix/base.py‎
Lines changed: 101 additions & 0 deletions b/‎src/llmcompressor/modifiers/transform/imatrix/base.py‎
Lines changed: 101 additions & 0 deletions
diff --git a/‎src/llmcompressor/observers/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎src/llmcompressor/observers/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/llmcompressor/observers/base.py‎
Lines changed: 18 additions & 0 deletions b/‎src/llmcompressor/observers/base.py‎
Lines changed: 18 additions & 0 deletions
@@ -138,6 +138,105 @@ We can see the resulting scores look good!
 |     |       |strict-match    |     5|exact_match|↑  |0.720|±  |0.0285|
 ```
 
+---
+
+## iMatrix Importance-Weighted Quantization
+
+`imatrix_mse` is an observer that uses per-channel activation importance (E[x²]) to weight quantization error during range selection. Channels that carry more signal get more careful range optimization.
+
+Two components work together:
+- **`IMatrixGatherer`**: triggers a calibration pass so the observer can collect importance data
+- **`imatrix_mse` observer**: collects E[x²] per input channel via forward pre-hooks and uses importance weighting in the MSE grid search: `err = sum(importance * |Q(w) - w|^p)`
+
+> See [RFC #2456](https://github.com/vllm-project/llm-compressor/discussions/2456) for the full design discussion.
+
+### Usage
+
+```bash
+python3 llama3_imatrix_example.py
+```
+
+The simplest setup uses `preset_name_to_scheme` to configure W4A16 and swaps in the `imatrix_mse` observer:
+
+```python
+from compressed_tensors.quantization import preset_name_to_scheme
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor.modifiers.transform.imatrix import IMatrixGatherer
+
+scheme = preset_name_to_scheme("W4A16", ["Linear"])
+scheme.weights.observer = "imatrix_mse"
+
+recipe = [
+    IMatrixGatherer(ignore=["lm_head"]),
+    QuantizationModifier(
+        config_groups={"group_0": scheme},
+        ignore=["lm_head"],
+    ),
+]
+```
+
+### Composing with GPTQ
+
+iMatrix composes with GPTQ by providing importance-weighted ranges for the Hessian-based rounding:
+
+```python
+from llmcompressor.modifiers.gptq import GPTQModifier
+
+scheme = preset_name_to_scheme("W4A16", ["Linear"])
+scheme.weights.observer = "imatrix_mse"
+
+recipe = [
+    IMatrixGatherer(ignore=["lm_head"]),
+    GPTQModifier(
+        config_groups={"group_0": scheme},
+        ignore=["lm_head"],
+    ),
+]
+```
+
+### Results
+
+W4A16, Llama-3.1-8B, WikiText-2 token-level perplexity (141 chunks x 2048):
+
+**group_size=128:**
+
+| Config | PPL |
+|---|---|
+| FP16 baseline | 6.24 |
+| RTN `memoryless_minmax` | 6.96 |
+| RTN `imatrix_mse` | 6.97 |
+| GPTQ | 6.89 |
+| GPTQ + `imatrix_mse` | 6.82 |
+
+**group_size=32:**
+
+| Config | PPL |
+|---|---|
+| RTN `memoryless_minmax` | 6.74 |
+| RTN `imatrix_mse` | 6.73 |
+| GPTQ | 6.70 |
+| GPTQ + `imatrix_mse` | 6.66 |
+
+GPTQ + `imatrix_mse` is the best result at both group sizes with default observer settings. iMatrix never degrades quality.
+
+### Observer Parameters
+
+The observer accepts optional `observer_kwargs` for fine-tuning:
+
+| Parameter | Default | Description |
+|---|---|---|
+| `norm` | 2.4 | Error exponent (`\|Q(w) - w\|^norm`) |
+| `maxshrink` | 0.20 | Max fraction to shrink the range |
+| `grid` | 20 | Number of grid search steps |
+| `patience` | 5 | Early stopping after N steps without improvement |
+| `maxgrow` | 0.0 | Max fraction to grow the range beyond observed min/max |
+
+The defaults work well for GPTQ composition. For RTN, increasing `maxshrink` (e.g. 0.80) allows the observer to optimize ranges more aggressively:
+
+```python
+scheme.weights.observer_kwargs = {"maxshrink": 0.80}
+```
+
 ### Questions or Feature Request?
 
 Please open up an issue on `vllm-project/llm-compressor`
@@ -0,0 +1,61 @@
+from compressed_tensors.offload import dispatch_model
+from compressed_tensors.quantization import preset_name_to_scheme
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor.modifiers.transform.imatrix import IMatrixGatherer
+
+# Select model and load it.
+model_id = "meta-llama/Meta-Llama-3.1-8B"
+
+model = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# Select calibration dataset.
+DATASET_ID = "open_platypus"
+
+# Select number of samples. 512 samples is a good place to start.
+# Increasing the number of samples can improve accuracy.
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048
+
+# Configure the quantization algorithm to run.
+#   * trigger a calibration pass with IMatrixGatherer so the observer can collect E[x²]
+#   * quantize the weights to 4 bit with group size 128
+#   * use imatrix_mse observer to weight quantization error by channel importance
+scheme = preset_name_to_scheme("W4A16", ["Linear"])
+scheme.weights.observer = "imatrix_mse"
+
+recipe = [
+    IMatrixGatherer(ignore=["lm_head"]),
+    QuantizationModifier(
+        config_groups={"group_0": scheme},
+        ignore=["lm_head"],
+    ),
+]
+
+# Apply algorithms.
+oneshot(
+    model=model,
+    dataset=DATASET_ID,
+    splits={"calibration": "train[:5%]"},
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)
+
+# Confirm generations of the quantized model look sane.
+print("\n\n")
+print("========== SAMPLE GENERATION ==============")
+dispatch_model(model)
+sample = tokenizer("Hello my name is", return_tensors="pt")
+sample = {key: value.to(model.device) for key, value in sample.items()}
+output = model.generate(**sample, max_new_tokens=100)
+print(tokenizer.decode(output[0]))
+print("==========================================\n\n")
+
+# Save to disk compressed.
+SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-imatrix"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
@@ -82,6 +82,7 @@ def initialize_observer(
             observer, base_name=base_name, args=args, module=module
         )
         module.register_module(f"{base_name}_observer", observer)
+        observer.init(module)
 
 
 def call_observer(
@@ -264,6 +265,7 @@ def freeze_module_quantization(module: Module):
     for name in ("input", "weight", "output", "q", "k", "v"):
         obs_name = f"{name}_observer"
         if hasattr(module, obs_name):
+            getattr(module, obs_name).detach(module)
             delattr(module, obs_name)
 
     module.quantization_status = QuantizationStatus.FROZEN
 
@@ -0,0 +1,3 @@
+# ruff: noqa
+
+from .base import *
@@ -0,0 +1,101 @@
+from typing import List, Union
+
+from compressed_tensors.quantization import disable_quantization
+from compressed_tensors.utils import match_named_modules
+from pydantic import Field
+
+from llmcompressor.core import Event, State
+from llmcompressor.modifiers import Modifier
+from llmcompressor.modifiers.quantization.quantization.mixin import QuantizationMixin
+
+__all__ = ["IMatrixGatherer"]
+
+
+class IMatrixGatherer(Modifier, QuantizationMixin):
+    """
+    Lifecycle trigger for iMatrix importance collection.
+
+    Triggers a calibration pass so that ``IMatrixMSEObserver`` can collect
+    E[x²] via its ``init()`` hook.  Does **not** quantize weights — the
+    actual quantization is done by the subsequent
+    ``QuantizationModifier`` / ``GPTQModifier``.
+
+    The observer's ``detach()`` method computes ``_imatrix_importance``
+    from the accumulated statistics and leaves it on the module for the
+    next quantization pass to consume.
+
+    Example recipe::
+
+        recipe:
+          - IMatrixGatherer:
+              ignore: ["lm_head"]
+          - QuantizationModifier:
+              config_groups:
+                group_0:
+                  targets: ["Linear"]
+                  weights:
+                    observer: imatrix_mse
+
+    Or composed with GPTQ::
+
+        recipe:
+          - IMatrixGatherer:
+              ignore: ["lm_head"]
+          - GPTQModifier:
+              config_groups:
+                group_0:
+                  targets: ["Linear"]
+                  weights:
+                    observer: imatrix_mse
+
+    .. note::
+        Auto-prepend (inserting the gatherer automatically when
+        ``imatrix_mse`` is detected in a recipe) is planned for a
+        follow-up PR.
+
+    :param scheme: quantization preset used to build the internal config.
+        Defaults to ``"W4A16"``.  The actual bit-width does not matter
+        because weights are never quantized by this modifier.
+    :param weight_observer: observer to attach during calibration.
+        Must be ``"imatrix_mse"`` (default).
+    :param ignore: layer name patterns to skip (default: ``["lm_head"]``)
+    :param targets: module types to instrument (default: ``["Linear"]``)
+    """
+
+    scheme: str = "W4A16"
+    weight_observer: str = "imatrix_mse"
+    ignore: List[str] = Field(default_factory=lambda: ["lm_head"])
+    targets: Union[str, List[str]] = Field(default_factory=lambda: ["Linear"])
+
+    # ------------------------------------------------------------------ #
+    #  Lifecycle
+    # ------------------------------------------------------------------ #
+
+    def on_initialize(self, state: State, **kwargs) -> bool:
+        QuantizationMixin.initialize_quantization(self, state.model)
+        return True
+
+    def on_start(self, state: State, event: Event, **kwargs):
+        self.started_ = True
+        QuantizationMixin.start_calibration(self, state.model)
+        # Disable quantized forward — we only need observer hooks for E[x²]
+        state.model.apply(disable_quantization)
+
+    def on_end(self, state: State, event: Event, **kwargs):
+        self.ended_ = True
+        QuantizationMixin.end_calibration(self, state.model)
+        # Disable quantized forward so the model is clean for the next modifier
+        state.model.apply(disable_quantization)
+
+    def on_finalize(self, state: State, **kwargs) -> bool:
+        if not self.ended_:
+            self.on_end(state, None)
+
+        # Clean up importance tensors so they don't end up in checkpoint
+        for _, module in match_named_modules(
+            state.model, self.resolved_targets, self.ignore
+        ):
+            if hasattr(module, "_imatrix_importance"):
+                del module._imatrix_importance
+
+        return True
@@ -14,3 +14,4 @@
 from .moving_base import *
 from .min_max import *
 from .mse import *
+from .imatrix import *
@@ -133,6 +133,24 @@ def _get_module_param(self, name: str) -> Optional[torch.nn.Parameter]:
         with align_module_device(module):
             return getattr(module, f"{self.base_name}_{name}", None)
 
+    def init(self, module: torch.nn.Module) -> None:
+        """
+        Called when the observer is attached to a module.
+        Subclasses can override to register hooks or initialize state.
+
+        :param module: the module this observer is being attached to
+        """
+        pass
+
+    def detach(self, module: torch.nn.Module) -> None:
+        """
+        Called before the observer is deleted from a module.
+        Subclasses can override to remove hooks and clean up module attributes.
+
+        :param module: the module this observer is being removed from
+        """
+        pass
+
     def _check_has_global_scale(self, global_scale: Optional[torch.nn.Parameter]):
         if (
             self.args.strategy == QuantizationStrategy.TENSOR_GROUP
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# ruff: noqa`
	`2`	`+`
	`3`	`+from .base import *`