Skip to content

feat: add iMatrix weighted MSE observer and IMatrixGatherer#2473

Open
Yatimai wants to merge 2 commits intovllm-project:mainfrom
Yatimai:feat/imatrix-observer
Open

feat: add iMatrix weighted MSE observer and IMatrixGatherer#2473
Yatimai wants to merge 2 commits intovllm-project:mainfrom
Yatimai:feat/imatrix-observer

Conversation

@Yatimai
Copy link
Contributor

@Yatimai Yatimai commented Mar 16, 2026

SUMMARY:

Adds imatrix_mse, a new weight observer that uses per-channel activation importance (E[x²]) to weight the quantization error during range selection. Channels that carry more signal get more careful range optimization.

Two components:

IMatrixGatherer (modifiers/transform/imatrix/base.py): lifecycle trigger that inherits from QuantizationMixin. Triggers a calibration pass so the observer can collect E[x²]. Does not quantize — delegates to the subsequent QuantizationModifier / GPTQModifier.

imatrix_mse observer (observers/imatrix.py): extends the MSE grid search with importance weighting: err = sum(importance * |Q(w) - w|^p). Owns the E[x²] collection lifecycle via Observer.init(module) / Observer.detach(module). Supports CHANNEL, GROUP, TENSOR_GROUP, and BLOCK strategies via flatten_for_calibration. Falls back to uniform MSE when importance is unavailable (strict=False default).

Usage:

recipe:
  - IMatrixGatherer:
      ignore: ["lm_head"]
  - QuantizationModifier:
      config_groups:
        group_0:
          targets: ["Linear"]
          weights:
            num_bits: 4
            observer: imatrix_mse

Composes with GPTQ:

recipe:
  - IMatrixGatherer:
      ignore: ["lm_head"]
  - GPTQModifier:
      config_groups:
        group_0:
          targets: ["Linear"]
          weights:
            observer: imatrix_mse

Results (W4A16, WikiText-2 token-level PPL, 141 chunks x 2048, open_platypus calibration 512 samples):

Llama-3.1-8B, group_size=128:

Config PPL
FP16 baseline 6.24
RTN memoryless_minmax 6.96
RTN imatrix_mse 6.97
GPTQ 6.89
GPTQ + imatrix_mse 6.82
RTN imatrix_mse (tuned) 6.81

Llama-3.1-8B, group_size=32:

Config PPL
RTN memoryless_minmax 6.74
RTN imatrix_mse 6.73
GPTQ 6.70
GPTQ + imatrix_mse 6.66
RTN imatrix_mse (tuned) 6.60

Llama-3.1-70B, group_size=128:

Config PPL
FP16 baseline 2.81
RTN memoryless_minmax 4.47
RTN imatrix_mse 3.80
RTN imatrix_mse (tuned) 3.40

With default settings, GPTQ + imatrix_mse improves GPTQ by 0.07 PPL at gs128 and 0.04 at gs32. With tuned observer settings (norm=3.2, maxshrink=0.95, maxgrow=0.10), RTN imatrix_mse outperforms GPTQ at both group sizes (~5min gs128, ~14min gs32 vs ~35min GPTQ). On 70B, iMatrix reduces the W4 degradation from +1.66 (minmax) to +0.60 (tuned), a 2.8x improvement.

Eval method: GPTQ paper (concatenate WikiText-2 test, non-overlapping segments of 2048 tokens, exp(mean(NLLs))).

RFC #2456


TEST PLAN:

45 tests added (all CPU, no GPU required):

  • test_imatrix.py (26 tests): grid search with non-uniform importance, CHANNEL/GROUP/TENSOR_GROUP/BLOCK strategies, actorder g_idx reordering, strict vs non-strict fallback, validation (shape, dtype, finite, non-negative)
  • test_imatrix_gatherer.py (15 tests): E[x²] collection, ignore list, accumulation correctness, hook cleanup, weight immutability, on_event lifecycle
  • test_e2e_integration.py (4 tests): full pipeline via oneshot() on nm-testing/tinysmokellama-3.2, gatherer-only, observer fallback without gatherer, regex targets
python -m pytest tests/llmcompressor/observers/test_imatrix.py tests/llmcompressor/modifiers/transform/imatrix/ -v

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the quantization capabilities by introducing an importance-weighted Mean Squared Error (MSE) observer. By dynamically assessing the signal importance of each activation channel, the system can make more informed decisions during quantization range selection, leading to improved model performance with reduced precision. This new approach integrates seamlessly with existing quantization methods, offering a general boost to efficiency without sacrificing accuracy.

Highlights

  • New Quantization Components: Introduced two new components: IMatrixGatherer for collecting per-channel activation importance (E[x�]) and imatrix_mse observer for importance-weighted quantization error minimization.
  • Improved Quantization Quality: The imatrix_mse observer enhances range selection by prioritizing channels with higher signal, leading to improved perplexity (PPL) in Llama-3.1-8B models across various quantization methods like RTN, AWQ, and GPTQ.
  • Seamless Integration: The new iMatrix approach is additive, meaning it can be composed with existing quantization techniques (e.g., AWQ, GPTQ) to further boost efficiency and quality without degradation.
  • Comprehensive Testing: Added 49 new tests, including unit tests for both IMatrixGatherer and imatrix_mse observer, as well as end-to-end integration tests to ensure robust functionality and correct behavior, including fallback mechanisms.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/modifiers/transform/imatrix/init.py
    • Added base module to the imatrix package.
  • src/llmcompressor/modifiers/transform/imatrix/base.py
    • Implemented the IMatrixGatherer class for collecting activation importance statistics.
    • Defined lifecycle methods (on_initialize, on_start, on_event, on_end, on_finalize) for hook management and data accumulation.
    • Included logic for resolving target modules, registering forward pre-hooks, and computing/attaching importance to modules.
  • src/llmcompressor/observers/init.py
    • Imported the new imatrix observer module.
  • src/llmcompressor/observers/imatrix.py
    • Implemented the IMatrixMSEObserver class, extending the MSE grid search with importance weighting.
    • Added methods for preparing and validating importance weights, handling g_idx reordering, and supporting various quantization strategies.
    • Included a _grid_search function for importance-weighted min/max range optimization.
  • tests/llmcompressor/modifiers/transform/imatrix/test_e2e_integration.py
    • Added end-to-end integration tests to verify the combined functionality of IMatrixGatherer and imatrix_mse observer.
    • Included tests for successful quantization, gatherer-only operation, and observer fallback without a gatherer.
  • tests/llmcompressor/modifiers/transform/imatrix/test_imatrix_gatherer.py
    • Added unit tests for IMatrixGatherer covering importance collection, ignore list functionality, accumulation correctness, hook removal, and weight immutability.
  • tests/llmcompressor/observers/test_imatrix.py
    • Added unit tests for IMatrixMSEObserver addressing specific bug fixes (global_scale with TENSOR_GROUP, g_idx reordering, weight-only guard) and general functionality.
    • Included tests for importance validation, fallback mechanisms, and the maxgrow parameter.
Activity
  • No human activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the IMatrixGatherer and imatrix_mse observer, which together enable importance-weighted quantization. The IMatrixGatherer collects per-channel activation importance (E[x²]) using forward pre-hooks, ensuring that channels carrying more signal receive more careful range optimization during quantization. The imatrix_mse observer extends the existing MSE grid search to incorporate these importance weights, improving quantization quality. The new functionality is additive, enhancing existing quantization methods like AWQ and GPTQ without degrading quality, and offers significant performance benefits for RTN quantization. Comprehensive unit and integration tests have been added to validate the new components and their interactions.

@Yatimai Yatimai force-pushed the feat/imatrix-observer branch 2 times, most recently from 6b6cd21 to b982e23 Compare March 16, 2026 18:26
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, once comments are addressed i think there are a few final pieces

  1. add an example to demonstrate the technique
  2. either add the feature where iMatrixGatherer is prepended to the recipe if the observer is added or leave a TODO for that feature

@Yatimai Yatimai force-pushed the feat/imatrix-observer branch from b982e23 to 9bc268b Compare March 17, 2026 16:32
@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 17, 2026
@Yatimai Yatimai force-pushed the feat/imatrix-observer branch from 9bc268b to adef032 Compare March 17, 2026 16:43
@Yatimai
Copy link
Contributor Author

Yatimai commented Mar 17, 2026

All review comments addressed:

  • Per-token accumulation (aligned with llama.cpp)
  • flatten_for_calibration for importance broadcasting (handles all strategies + g_idx)
  • Replaced _warn with direct logger.warning() + return None
  • TODO for global_scale importance support
  • TODO for grid search refactor
  • Added example in examples/quantization_w4a16/
  • Auto-prepend TODO in gatherer docstring

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the APIs and examples across the PR code and descriptions to be a bit cleaner by using the weight_observer argument in the XModifier and making sure the defaults are reasonable for most users. Could also mutate an existing scheme e.g.

scheme = preset_name_to_scheme('W4A16', ['Linear'])
scheme.weights.observer = 'imatrix_mse'
scheme.weights.observer_kwargs = ...

...config_groups={"group_0": scheme}

probably want the simplest API for the example and can document the alternatives in a README explaining the technique and usage

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see remaining comments, i.e. documentation, improve API for better UX, tests, norm?

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for preparing this @Yatimai , I think things are looking on from a code standpoint. I have a few nits, and it would be good to include a README that helps users understand when and how to use this, potentially including some of the nice results you have in PR summary

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a functionality perspective, I'm a little confused how imatrix computation is actually occuring:

The IMatrixMSEObserver requires that _compute_and_attach is called in order to use the importance values. However, _compute_and_attach is only called at the sequential epoch end and the observer is never called after _compute_and_attach is called, so the importance values are never used?

From a design perspective, the implementation seems split between the IMatrixMSEObserver and IMatrixGatherer. This design is difficult to work with because it allows users to create footgun recipes where they have an imatrix observer but no imatrix modifier, and vice versa.

I recommend implementing only the IMatrixQuantizationModifier. This should help simplify the design a lot. This also prevents someone from using the IMatrixMSEObserver for activations, which doesn't really make sense.

@HDCharles
Copy link
Collaborator

HDCharles commented Mar 17, 2026

From a functionality perspective, I'm a little confused how imatrix computation is actually occuring:

The IMatrixMSEObserver requires that _compute_and_attach is called in order to use the importance values. However, _compute_and_attach is only called at the sequential epoch end and the observer is never called after _compute_and_attach is called, so the importance values are never used?

From a design perspective, the implementation seems split between the IMatrixMSEObserver and IMatrixGatherer. This design is difficult to work with because it allows users to create footgun recipes where they have an imatrix observer but no imatrix modifier, and vice versa.

I recommend implementing only the IMatrixQuantizationModifier. This should help simplify the design a lot. This also prevents someone from using the IMatrixMSEObserver for activations, which doesn't really make sense.

@kylesayrs This should compose with other modifiers so it needs to be an observer, changing that would make this pointless, it would no longer compose with GPTQ so why use this?

The issue with foot gun recipes is not hard to resolve by validating both exist or adding the gatherer if the imatrix observer is detected.

@Yatimai Yatimai force-pushed the feat/imatrix-observer branch from adef032 to 20f2a0a Compare March 18, 2026 17:55
@Yatimai
Copy link
Contributor Author

Yatimai commented Mar 18, 2026

All review comments addressed:

  • Per-token accumulation
  • flatten_for_calibration for importance broadcasting (all strategies + g_idx)
  • Replaced _warn with direct logger.warning() + return None
  • TODO for global_scale importance support
  • TODO for grid search refactor
  • Example simplified with preset_name_to_scheme
  • README added in examples/quantization_w4a16/
  • Tests for all strategies: CHANNEL, GROUP, TENSOR_GROUP, BLOCK
  • Maxgrow tests removed per review
  • Actorder tests pruned to essential cases
  • E2E test with regex targets added
  • _imatrix_importance cleanup in on_finalize
  • _sums keyed by module instead of name
  • IMATRIX_PRECISION constant + math.prod()
  • Docstrings merged (file -> class)
  • Auto-prepend TODO in gatherer docstring

Updated benchmarks and results in the PR description.

43 tests pass. make style + make quality pass.

@kylesayrs
Copy link
Collaborator

kylesayrs commented Mar 18, 2026

Hey @Yatimai ,

Sorry for the confusion w.r.t. the design of this feature. I’ve synced extensively with @HDCharles and I think I have some suggestions for an approach to this feature.

The confusion and difficulty mainly stems from how this PR interacts with [WIP] [DDP] Refactor quantization lifecycle for performance. I would recommend taking the following steps:

  1. Move all imatrix functionality to the observer. The observer should not only include the grid search implementation, but the IMatrixMSEObserver should also be responsible for adding a calibration hook to the module which is used to calculate the matrix.
    1. You can add a new method Observer.init(module: Module) to the Observer base class which is called when the observer initializes here. This gives the IMatrixMSEObserver subclass the opportunity to attach a pre_forward hook to the module.
    2. You can add a new method Observer.detach(module: Module) which calls hook.remove() to remove the hook you added. This method should be called when observers are deleted here.
  2. Make sure that the observer deletes the imatrix off of the module so that it doesn’t end up in the checkpoint. This can be done during the aforementioned Observer.detach method.
  3. Now that all the functionality is in the observer, the IMatrixGatherer does not have any functionality. However, due to how QuantizationModifier triggers weight quantization at on start, we need some sort of way to trigger a calibration pass before weight quantization. For this reason, we must keep IMatrixGatherer to trigger a forward calibration pass.
Recipe: [IMatrixGatherer, QuantizationModifier]

1. IMatrixGatherer.start
	-> IMatrixObservers are attached
2. Sequential Pipeline start
	-> module._imatrix_importance is calibrated
3. IMatrixGatherer.end
	-> Observers are removed
4. QuantizationModifier.start
	-> Observers are attached (_imatrix_importance remains on module)
	-> Weight quantization occurs (with imatrices)
5. DataFree Pipline start
	-> nothing happens
6. QuantizationModifier.end
	-> Observers are removed

IMatrixGatherer should call initialize_quantization on initialize, start_calibration on start, and end_calibration on end.

Changes After Refactor

Once [WIP] [DDP] Refactor quantization lifecycle for performance lands and weight quantization is moved to the end of the sequential epoch, we can do the following simplifications:

  1. Move module._imatrix_importance from the module to IMatrixObserver._imatrix_importance
  2. Update examples and docs to remove/deprecate IMatrixGatherer , leaving only IMatrixMSEObserver to implement the algorithm.
  3. Update QuantizationConfig.requires_calibration_data to be True if using the IMatrixMSEObserver

@Yatimai
Copy link
Contributor Author

Yatimai commented Mar 18, 2026

Thanks @kylesayrs, this is very clear : moving the hook to the observer keeps the algorithm self-contained while the gatherer handles the lifecycle trigger. Happy to rework along these lines.

@Yatimai
Copy link
Contributor Author

Yatimai commented Mar 19, 2026

@kylesayrs Quick question on the gatherer: since initialize_quantization/start_calibration/end_calibration are methods of QuantizationMixin, should IMatrixGatherer inherit from QuantizationMixin and build a minimal internal config? Or would you prefer it calls the lower-level functions (initialize_observer, freeze_module_quantization) directly?

@kylesayrs
Copy link
Collaborator

@Yatimai I would suggest inheriting from QuantizationMixin

@Yatimai
Copy link
Contributor Author

Yatimai commented Mar 19, 2026

@kylesayrs I checked : the double apply_quantization_config is safe (clear_all_qparams resets known params first), and _imatrix_importance survives since it's not in the cleared params list. I'll proceed with IMatrixGatherer inheriting from QuantizationMixin.

@Yatimai Yatimai force-pushed the feat/imatrix-observer branch from 20f2a0a to 810d7fd Compare March 19, 2026 19:34
@Yatimai
Copy link
Contributor Author

Yatimai commented Mar 19, 2026

Refactor per @kylesayrs design:

  • Observer.init(module) / Observer.detach(module) added to base class
  • IMatrixMSEObserver now owns the E[x²] hook lifecycle (init attaches, detach computes importance)
  • calibration.py calls init after observer creation and detach before deletion
  • IMatrixGatherer inherits from QuantizationMixin, delegates to initialize_quantization / start_calibration / end_calibration
  • _imatrix_importance persists between gatherer and quantization modifier, cleaned up in on_finalize

Benchmarks unchanged, refactor is structural only, no impact on quantization results.

45 tests pass. make style + make quality clean.

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give a full review tomorrow, but no notes so far :)

@Yatimai Yatimai force-pushed the feat/imatrix-observer branch from 810d7fd to 6b5cabd Compare March 20, 2026 00:51
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome tests! Looks great to me

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Yatimai , thanks for all the work on this. I think the top-level API looks good, but i have some questions on the new observer api changes, if they can be resolved another way. Please see comments:


---

## iMatrix Importance-Weighted Quantization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this documentation and the example above might be better in a separate examples/imatrix folder. it looks like this is pretty general for imatrix, and not super specific to W4A16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move to examples/imatrix/.

with align_module_device(module):
return getattr(module, f"{self.base_name}_{name}", None)

def init(self, module: torch.nn.Module) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if init and detach are inverse operations, i think we should use names indicating as such

Suggested change
def init(self, module: torch.nn.Module) -> None:
def attach(self, module: torch.nn.Module) -> None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rename init to attach.

mod._imatrix_sum.add_(token_sum)
mod._imatrix_count += n_tokens

module._imatrix_hook = module.register_forward_pre_hook(_hook)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is the first case of attaching hooks inside the implementation of an observer? Rather than expanding the api with attach/detach on observers, can't we instead use QuantizationMixin's _initialize_observers and _initialize_hooks functions? They add observers to modules and hooks to capture input activations. Seems like that same pattern can be used here.

I'm not sure how well this Observer attach/detach logic would play with the way we use observers elsewhere in the code, for example in AWQ here (though that's only for weight observers though so maybe it's alright)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Observer.attach/detach API was designed per @kylesayrs's suggestion to keep the E[x²] lifecycle self-contained in the observer.

Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>
@Yatimai Yatimai force-pushed the feat/imatrix-observer branch from 63eb096 to 8429a6f Compare March 25, 2026 08:42
Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, should be ready to land after the last batch of changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants