feat: add iMatrix weighted MSE observer and IMatrixGatherer by Yatimai · Pull Request #2473 · vllm-project/llm-compressor

Yatimai · 2026-03-16T18:19:56Z

SUMMARY:

Adds imatrix_mse, a new weight observer that uses per-channel activation importance (E[x²]) to weight the quantization error during range selection. Channels that carry more signal get more careful range optimization.

Two components:

IMatrixGatherer (modifiers/transform/imatrix/base.py): lifecycle trigger that inherits from QuantizationMixin. Triggers a calibration pass so the observer can collect E[x²]. Does not quantize — delegates to the subsequent QuantizationModifier / GPTQModifier.

imatrix_mse observer (observers/imatrix.py): extends the MSE grid search with importance weighting: err = sum(importance * |Q(w) - w|^p). Owns the E[x²] collection lifecycle via Observer.init(module) / Observer.detach(module). Supports CHANNEL, GROUP, TENSOR_GROUP, and BLOCK strategies via flatten_for_calibration. Falls back to uniform MSE when importance is unavailable (strict=False default).

Usage:

recipe:
  - IMatrixGatherer:
      ignore: ["lm_head"]
  - QuantizationModifier:
      config_groups:
        group_0:
          targets: ["Linear"]
          weights:
            num_bits: 4
            observer: imatrix_mse

Composes with GPTQ:

recipe:
  - IMatrixGatherer:
      ignore: ["lm_head"]
  - GPTQModifier:
      config_groups:
        group_0:
          targets: ["Linear"]
          weights:
            observer: imatrix_mse

Results (W4A16, WikiText-2 token-level PPL, 141 chunks x 2048, open_platypus calibration 512 samples):

Llama-3.1-8B, group_size=128:

Config	PPL
FP16 baseline	6.24
RTN `memoryless_minmax`	6.96
RTN `imatrix_mse`	6.97
GPTQ	6.89
GPTQ + `imatrix_mse`	6.82
RTN `imatrix_mse` (tuned)	6.81

Llama-3.1-8B, group_size=32:

Config	PPL
RTN `memoryless_minmax`	6.74
RTN `imatrix_mse`	6.73
GPTQ	6.70
GPTQ + `imatrix_mse`	6.66
RTN `imatrix_mse` (tuned)	6.60

Llama-3.1-70B, group_size=128:

Config	PPL
FP16 baseline	2.81
RTN `memoryless_minmax`	4.47
RTN `imatrix_mse`	3.80
RTN `imatrix_mse` (tuned)	3.40

With default settings, GPTQ + imatrix_mse improves GPTQ by 0.07 PPL at gs128 and 0.04 at gs32. With tuned observer settings (norm=3.2, maxshrink=0.95, maxgrow=0.10), RTN imatrix_mse outperforms GPTQ at both group sizes (~5min gs128, ~14min gs32 vs ~35min GPTQ). On 70B, iMatrix reduces the W4 degradation from +1.66 (minmax) to +0.60 (tuned), a 2.8x improvement.

Eval method: GPTQ paper (concatenate WikiText-2 test, non-overlapping segments of 2048 tokens, exp(mean(NLLs))).

RFC #2456

TEST PLAN:

45 tests added (all CPU, no GPU required):

test_imatrix.py (26 tests): grid search with non-uniform importance, CHANNEL/GROUP/TENSOR_GROUP/BLOCK strategies, actorder g_idx reordering, strict vs non-strict fallback, validation (shape, dtype, finite, non-negative)
test_imatrix_gatherer.py (15 tests): E[x²] collection, ignore list, accumulation correctness, hook cleanup, weight immutability, on_event lifecycle
test_e2e_integration.py (4 tests): full pipeline via oneshot() on nm-testing/tinysmokellama-3.2, gatherer-only, observer fallback without gatherer, regex targets

python -m pytest tests/llmcompressor/observers/test_imatrix.py tests/llmcompressor/modifiers/transform/imatrix/ -v

gemini-code-assist · 2026-03-16T18:20:24Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the quantization capabilities by introducing an importance-weighted Mean Squared Error (MSE) observer. By dynamically assessing the signal importance of each activation channel, the system can make more informed decisions during quantization range selection, leading to improved model performance with reduced precision. This new approach integrates seamlessly with existing quantization methods, offering a general boost to efficiency without sacrificing accuracy.

Highlights

New Quantization Components: Introduced two new components: IMatrixGatherer for collecting per-channel activation importance (E[x�]) and imatrix_mse observer for importance-weighted quantization error minimization.
Improved Quantization Quality: The imatrix_mse observer enhances range selection by prioritizing channels with higher signal, leading to improved perplexity (PPL) in Llama-3.1-8B models across various quantization methods like RTN, AWQ, and GPTQ.
Seamless Integration: The new iMatrix approach is additive, meaning it can be composed with existing quantization techniques (e.g., AWQ, GPTQ) to further boost efficiency and quality without degradation.
Comprehensive Testing: Added 49 new tests, including unit tests for both IMatrixGatherer and imatrix_mse observer, as well as end-to-end integration tests to ensure robust functionality and correct behavior, including fallback mechanisms.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/modifiers/transform/imatrix/init.py
- Added base module to the imatrix package.
src/llmcompressor/modifiers/transform/imatrix/base.py
- Implemented the IMatrixGatherer class for collecting activation importance statistics.
- Defined lifecycle methods (on_initialize, on_start, on_event, on_end, on_finalize) for hook management and data accumulation.
- Included logic for resolving target modules, registering forward pre-hooks, and computing/attaching importance to modules.
src/llmcompressor/observers/init.py
- Imported the new imatrix observer module.
src/llmcompressor/observers/imatrix.py
- Implemented the IMatrixMSEObserver class, extending the MSE grid search with importance weighting.
- Added methods for preparing and validating importance weights, handling g_idx reordering, and supporting various quantization strategies.
- Included a _grid_search function for importance-weighted min/max range optimization.
tests/llmcompressor/modifiers/transform/imatrix/test_e2e_integration.py
- Added end-to-end integration tests to verify the combined functionality of IMatrixGatherer and imatrix_mse observer.
- Included tests for successful quantization, gatherer-only operation, and observer fallback without a gatherer.
tests/llmcompressor/modifiers/transform/imatrix/test_imatrix_gatherer.py
- Added unit tests for IMatrixGatherer covering importance collection, ignore list functionality, accumulation correctness, hook removal, and weight immutability.
tests/llmcompressor/observers/test_imatrix.py
- Added unit tests for IMatrixMSEObserver addressing specific bug fixes (global_scale with TENSOR_GROUP, g_idx reordering, weight-only guard) and general functionality.
- Included tests for importance validation, fallback mechanisms, and the maxgrow parameter.

Activity

No human activity (comments, reviews, progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the IMatrixGatherer and imatrix_mse observer, which together enable importance-weighted quantization. The IMatrixGatherer collects per-channel activation importance (E[x²]) using forward pre-hooks, ensuring that channels carrying more signal receive more careful range optimization during quantization. The imatrix_mse observer extends the existing MSE grid search to incorporate these importance weights, improving quantization quality. The new functionality is additive, enhancing existing quantization methods like AWQ and GPTQ without degrading quality, and offers significant performance benefits for RTN quantization. Comprehensive unit and integration tests have been added to validate the new components and their interactions.

src/llmcompressor/modifiers/transform/imatrix/base.py

github-actions · 2026-03-16T18:28:50Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

src/llmcompressor/modifiers/transform/imatrix/base.py

src/llmcompressor/observers/imatrix.py

src/llmcompressor/modifiers/transform/imatrix/base.py

src/llmcompressor/observers/imatrix.py

HDCharles

looks good, once comments are addressed i think there are a few final pieces

add an example to demonstrate the technique
either add the feature where iMatrixGatherer is prepended to the recipe if the observer is added or leave a TODO for that feature

Yatimai · 2026-03-17T16:56:41Z

All review comments addressed:

Per-token accumulation (aligned with llama.cpp)
flatten_for_calibration for importance broadcasting (handles all strategies + g_idx)
Replaced _warn with direct logger.warning() + return None
TODO for global_scale importance support
TODO for grid search refactor
Added example in examples/quantization_w4a16/
Auto-prepend TODO in gatherer docstring

HDCharles

can you update the APIs and examples across the PR code and descriptions to be a bit cleaner by using the weight_observer argument in the XModifier and making sure the defaults are reasonable for most users. Could also mutate an existing scheme e.g.

scheme = preset_name_to_scheme('W4A16', ['Linear'])
scheme.weights.observer = 'imatrix_mse'
scheme.weights.observer_kwargs = ...

...config_groups={"group_0": scheme}

probably want the simplest API for the example and can document the alternatives in a README explaining the technique and usage

src/llmcompressor/observers/imatrix.py

tests/llmcompressor/observers/test_imatrix.py

HDCharles

see remaining comments, i.e. documentation, improve API for better UX, tests, norm?

brian-dellabetta

Thanks for preparing this @Yatimai , I think things are looking on from a code standpoint. I have a few nits, and it would be good to include a README that helps users understand when and how to use this, potentially including some of the nice results you have in PR summary

src/llmcompressor/modifiers/transform/imatrix/base.py

tests/llmcompressor/modifiers/transform/imatrix/test_e2e_integration.py

tests/llmcompressor/observers/test_imatrix.py

src/llmcompressor/observers/imatrix.py

kylesayrs

From a functionality perspective, I'm a little confused how imatrix computation is actually occuring:

The IMatrixMSEObserver requires that _compute_and_attach is called in order to use the importance values. However, _compute_and_attach is only called at the sequential epoch end and the observer is never called after _compute_and_attach is called, so the importance values are never used?

From a design perspective, the implementation seems split between the IMatrixMSEObserver and IMatrixGatherer. This design is difficult to work with because it allows users to create footgun recipes where they have an imatrix observer but no imatrix modifier, and vice versa.

I recommend implementing only the IMatrixQuantizationModifier. This should help simplify the design a lot. This also prevents someone from using the IMatrixMSEObserver for activations, which doesn't really make sense.

src/llmcompressor/modifiers/transform/imatrix/base.py

src/llmcompressor/observers/imatrix.py

HDCharles · 2026-03-17T21:06:56Z

From a functionality perspective, I'm a little confused how imatrix computation is actually occuring:

The IMatrixMSEObserver requires that _compute_and_attach is called in order to use the importance values. However, _compute_and_attach is only called at the sequential epoch end and the observer is never called after _compute_and_attach is called, so the importance values are never used?

From a design perspective, the implementation seems split between the IMatrixMSEObserver and IMatrixGatherer. This design is difficult to work with because it allows users to create footgun recipes where they have an imatrix observer but no imatrix modifier, and vice versa.

I recommend implementing only the IMatrixQuantizationModifier. This should help simplify the design a lot. This also prevents someone from using the IMatrixMSEObserver for activations, which doesn't really make sense.

@kylesayrs This should compose with other modifiers so it needs to be an observer, changing that would make this pointless, it would no longer compose with GPTQ so why use this?

The issue with foot gun recipes is not hard to resolve by validating both exist or adding the gatherer if the imatrix observer is detected.

Yatimai · 2026-03-18T17:58:06Z

All review comments addressed:

Per-token accumulation
flatten_for_calibration for importance broadcasting (all strategies + g_idx)
Replaced _warn with direct logger.warning() + return None
TODO for global_scale importance support
TODO for grid search refactor
Example simplified with preset_name_to_scheme
README added in examples/quantization_w4a16/
Tests for all strategies: CHANNEL, GROUP, TENSOR_GROUP, BLOCK
Maxgrow tests removed per review
Actorder tests pruned to essential cases
E2E test with regex targets added
_imatrix_importance cleanup in on_finalize
_sums keyed by module instead of name
IMATRIX_PRECISION constant + math.prod()
Docstrings merged (file -> class)
Auto-prepend TODO in gatherer docstring

Updated benchmarks and results in the PR description.

43 tests pass. make style + make quality pass.

kylesayrs · 2026-03-18T21:11:59Z

Hey @Yatimai ,

Sorry for the confusion w.r.t. the design of this feature. I’ve synced extensively with @HDCharles and I think I have some suggestions for an approach to this feature.

The confusion and difficulty mainly stems from how this PR interacts with [WIP] [DDP] Refactor quantization lifecycle for performance. I would recommend taking the following steps:

Move all imatrix functionality to the observer. The observer should not only include the grid search implementation, but the IMatrixMSEObserver should also be responsible for adding a calibration hook to the module which is used to calculate the matrix.
1. You can add a new method Observer.init(module: Module) to the Observer base class which is called when the observer initializes here. This gives the IMatrixMSEObserver subclass the opportunity to attach a pre_forward hook to the module.
2. You can add a new method Observer.detach(module: Module) which calls hook.remove() to remove the hook you added. This method should be called when observers are deleted here.
Make sure that the observer deletes the imatrix off of the module so that it doesn’t end up in the checkpoint. This can be done during the aforementioned Observer.detach method.
Now that all the functionality is in the observer, the IMatrixGatherer does not have any functionality. However, due to how QuantizationModifier triggers weight quantization at on start, we need some sort of way to trigger a calibration pass before weight quantization. For this reason, we must keep IMatrixGatherer to trigger a forward calibration pass.

Recipe: [IMatrixGatherer, QuantizationModifier]

1. IMatrixGatherer.start
	-> IMatrixObservers are attached
2. Sequential Pipeline start
	-> module._imatrix_importance is calibrated
3. IMatrixGatherer.end
	-> Observers are removed
4. QuantizationModifier.start
	-> Observers are attached (_imatrix_importance remains on module)
	-> Weight quantization occurs (with imatrices)
5. DataFree Pipline start
	-> nothing happens
6. QuantizationModifier.end
	-> Observers are removed

IMatrixGatherer should call initialize_quantization on initialize, start_calibration on start, and end_calibration on end.

Changes After Refactor

Once [WIP] [DDP] Refactor quantization lifecycle for performance lands and weight quantization is moved to the end of the sequential epoch, we can do the following simplifications:

Move module._imatrix_importance from the module to IMatrixObserver._imatrix_importance
Update examples and docs to remove/deprecate IMatrixGatherer , leaving only IMatrixMSEObserver to implement the algorithm.
Update QuantizationConfig.requires_calibration_data to be True if using the IMatrixMSEObserver

Yatimai · 2026-03-18T21:31:11Z

Thanks @kylesayrs, this is very clear : moving the hook to the observer keeps the algorithm self-contained while the gatherer handles the lifecycle trigger. Happy to rework along these lines.

Yatimai · 2026-03-19T14:24:07Z

@kylesayrs Quick question on the gatherer: since initialize_quantization/start_calibration/end_calibration are methods of QuantizationMixin, should IMatrixGatherer inherit from QuantizationMixin and build a minimal internal config? Or would you prefer it calls the lower-level functions (initialize_observer, freeze_module_quantization) directly?

kylesayrs · 2026-03-19T14:44:36Z

@Yatimai I would suggest inheriting from QuantizationMixin

Yatimai · 2026-03-19T14:51:15Z

@kylesayrs I checked : the double apply_quantization_config is safe (clear_all_qparams resets known params first), and _imatrix_importance survives since it's not in the cleared params list. I'll proceed with IMatrixGatherer inheriting from QuantizationMixin.

Yatimai · 2026-03-19T19:46:48Z

Refactor per @kylesayrs design:

Observer.init(module) / Observer.detach(module) added to base class
IMatrixMSEObserver now owns the E[x²] hook lifecycle (init attaches, detach computes importance)
calibration.py calls init after observer creation and detach before deletion
IMatrixGatherer inherits from QuantizationMixin, delegates to initialize_quantization / start_calibration / end_calibration
_imatrix_importance persists between gatherer and quantization modifier, cleaned up in on_finalize

Benchmarks unchanged, refactor is structural only, no impact on quantization results.

45 tests pass. make style + make quality clean.

kylesayrs

I'll give a full review tomorrow, but no notes so far :)

kylesayrs

Awesome tests! Looks great to me

src/llmcompressor/modifiers/transform/imatrix/base.py

brian-dellabetta

Hi @Yatimai , thanks for all the work on this. I think the top-level API looks good, but i have some questions on the new observer api changes, if they can be resolved another way. Please see comments:

brian-dellabetta · 2026-03-18T21:05:01Z

examples/quantization_w4a16/README.md


+---
+
+## iMatrix Importance-Weighted Quantization


this documentation and the example above might be better in a separate examples/imatrix folder. it looks like this is pretty general for imatrix, and not super specific to W4A16

I will move to examples/imatrix/.

brian-dellabetta · 2026-03-24T22:09:00Z

src/llmcompressor/observers/base.py

        with align_module_device(module):
            return getattr(module, f"{self.base_name}_{name}", None)

+    def init(self, module: torch.nn.Module) -> None:


if init and detach are inverse operations, i think we should use names indicating as such

Suggested change

def init(self, module: torch.nn.Module) -> None:

def attach(self, module: torch.nn.Module) -> None:

I will rename init to attach.

brian-dellabetta · 2026-03-24T22:16:17Z

src/llmcompressor/observers/imatrix.py

+            mod._imatrix_sum.add_(token_sum)
+            mod._imatrix_count += n_tokens
+
+        module._imatrix_hook = module.register_forward_pre_hook(_hook)


i think this is the first case of attaching hooks inside the implementation of an observer? Rather than expanding the api with attach/detach on observers, can't we instead use QuantizationMixin's _initialize_observers and _initialize_hooks functions? They add observers to modules and hooks to capture input activations. Seems like that same pattern can be used here.

I'm not sure how well this Observer attach/detach logic would play with the way we use observers elsewhere in the code, for example in AWQ here (though that's only for weight observers though so maybe it's alright)

The Observer.attach/detach API was designed per @kylesayrs's suggestion to keep the E[x²] lifecycle self-contained in the observer.

Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>

HDCharles

looks good, should be ready to land after the last batch of changes

Yatimai requested review from HDCharles, dsikka and kylesayrs as code owners March 16, 2026 18:19

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

src/llmcompressor/modifiers/transform/imatrix/base.py Outdated Show resolved Hide resolved

Yatimai force-pushed the feat/imatrix-observer branch 2 times, most recently from 6b6cd21 to b982e23 Compare March 16, 2026 18:26