feat: add iMatrix weighted MSE observer and IMatrixGatherer#2473
feat: add iMatrix weighted MSE observer and IMatrixGatherer#2473Yatimai wants to merge 2 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the quantization capabilities by introducing an importance-weighted Mean Squared Error (MSE) observer. By dynamically assessing the signal importance of each activation channel, the system can make more informed decisions during quantization range selection, leading to improved model performance with reduced precision. This new approach integrates seamlessly with existing quantization methods, offering a general boost to efficiency without sacrificing accuracy. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the IMatrixGatherer and imatrix_mse observer, which together enable importance-weighted quantization. The IMatrixGatherer collects per-channel activation importance (E[x²]) using forward pre-hooks, ensuring that channels carrying more signal receive more careful range optimization during quantization. The imatrix_mse observer extends the existing MSE grid search to incorporate these importance weights, improving quantization quality. The new functionality is additive, enhancing existing quantization methods like AWQ and GPTQ without degrading quality, and offers significant performance benefits for RTN quantization. Comprehensive unit and integration tests have been added to validate the new components and their interactions.
6b6cd21 to
b982e23
Compare
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
HDCharles
left a comment
There was a problem hiding this comment.
looks good, once comments are addressed i think there are a few final pieces
- add an example to demonstrate the technique
- either add the feature where iMatrixGatherer is prepended to the recipe if the observer is added or leave a TODO for that feature
b982e23 to
9bc268b
Compare
9bc268b to
adef032
Compare
|
All review comments addressed:
|
HDCharles
left a comment
There was a problem hiding this comment.
can you update the APIs and examples across the PR code and descriptions to be a bit cleaner by using the weight_observer argument in the XModifier and making sure the defaults are reasonable for most users. Could also mutate an existing scheme e.g.
scheme = preset_name_to_scheme('W4A16', ['Linear'])
scheme.weights.observer = 'imatrix_mse'
scheme.weights.observer_kwargs = ...
...config_groups={"group_0": scheme}
probably want the simplest API for the example and can document the alternatives in a README explaining the technique and usage
HDCharles
left a comment
There was a problem hiding this comment.
see remaining comments, i.e. documentation, improve API for better UX, tests, norm?
brian-dellabetta
left a comment
There was a problem hiding this comment.
Thanks for preparing this @Yatimai , I think things are looking on from a code standpoint. I have a few nits, and it would be good to include a README that helps users understand when and how to use this, potentially including some of the nice results you have in PR summary
There was a problem hiding this comment.
From a functionality perspective, I'm a little confused how imatrix computation is actually occuring:
The IMatrixMSEObserver requires that _compute_and_attach is called in order to use the importance values. However, _compute_and_attach is only called at the sequential epoch end and the observer is never called after _compute_and_attach is called, so the importance values are never used?
From a design perspective, the implementation seems split between the IMatrixMSEObserver and IMatrixGatherer. This design is difficult to work with because it allows users to create footgun recipes where they have an imatrix observer but no imatrix modifier, and vice versa.
I recommend implementing only the IMatrixQuantizationModifier. This should help simplify the design a lot. This also prevents someone from using the IMatrixMSEObserver for activations, which doesn't really make sense.
@kylesayrs This should compose with other modifiers so it needs to be an observer, changing that would make this pointless, it would no longer compose with GPTQ so why use this? The issue with foot gun recipes is not hard to resolve by validating both exist or adding the gatherer if the imatrix observer is detected. |
adef032 to
20f2a0a
Compare
|
All review comments addressed:
Updated benchmarks and results in the PR description. 43 tests pass. |
|
Hey @Yatimai , Sorry for the confusion w.r.t. the design of this feature. I’ve synced extensively with @HDCharles and I think I have some suggestions for an approach to this feature. The confusion and difficulty mainly stems from how this PR interacts with [WIP] [DDP] Refactor quantization lifecycle for performance. I would recommend taking the following steps:
Changes After RefactorOnce [WIP] [DDP] Refactor quantization lifecycle for performance lands and weight quantization is moved to the end of the sequential epoch, we can do the following simplifications:
|
|
Thanks @kylesayrs, this is very clear : moving the hook to the observer keeps the algorithm self-contained while the gatherer handles the lifecycle trigger. Happy to rework along these lines. |
|
@kylesayrs Quick question on the gatherer: since |
|
@Yatimai I would suggest inheriting from |
|
@kylesayrs I checked : the double |
20f2a0a to
810d7fd
Compare
|
Refactor per @kylesayrs design:
Benchmarks unchanged, refactor is structural only, no impact on quantization results. 45 tests pass. |
kylesayrs
left a comment
There was a problem hiding this comment.
I'll give a full review tomorrow, but no notes so far :)
810d7fd to
6b5cabd
Compare
kylesayrs
left a comment
There was a problem hiding this comment.
Awesome tests! Looks great to me
6b5cabd to
63eb096
Compare
brian-dellabetta
left a comment
There was a problem hiding this comment.
Hi @Yatimai , thanks for all the work on this. I think the top-level API looks good, but i have some questions on the new observer api changes, if they can be resolved another way. Please see comments:
|
|
||
| --- | ||
|
|
||
| ## iMatrix Importance-Weighted Quantization |
There was a problem hiding this comment.
this documentation and the example above might be better in a separate examples/imatrix folder. it looks like this is pretty general for imatrix, and not super specific to W4A16
There was a problem hiding this comment.
I will move to examples/imatrix/.
| with align_module_device(module): | ||
| return getattr(module, f"{self.base_name}_{name}", None) | ||
|
|
||
| def init(self, module: torch.nn.Module) -> None: |
There was a problem hiding this comment.
if init and detach are inverse operations, i think we should use names indicating as such
| def init(self, module: torch.nn.Module) -> None: | |
| def attach(self, module: torch.nn.Module) -> None: |
There was a problem hiding this comment.
I will rename init to attach.
| mod._imatrix_sum.add_(token_sum) | ||
| mod._imatrix_count += n_tokens | ||
|
|
||
| module._imatrix_hook = module.register_forward_pre_hook(_hook) |
There was a problem hiding this comment.
i think this is the first case of attaching hooks inside the implementation of an observer? Rather than expanding the api with attach/detach on observers, can't we instead use QuantizationMixin's _initialize_observers and _initialize_hooks functions? They add observers to modules and hooks to capture input activations. Seems like that same pattern can be used here.
I'm not sure how well this Observer attach/detach logic would play with the way we use observers elsewhere in the code, for example in AWQ here (though that's only for weight observers though so maybe it's alright)
There was a problem hiding this comment.
The Observer.attach/detach API was designed per @kylesayrs's suggestion to keep the E[x²] lifecycle self-contained in the observer.
Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>
63eb096 to
8429a6f
Compare
HDCharles
left a comment
There was a problem hiding this comment.
looks good, should be ready to land after the last batch of changes
SUMMARY:
Adds
imatrix_mse, a new weight observer that uses per-channel activation importance (E[x²]) to weight the quantization error during range selection. Channels that carry more signal get more careful range optimization.Two components:
IMatrixGatherer (
modifiers/transform/imatrix/base.py): lifecycle trigger that inherits fromQuantizationMixin. Triggers a calibration pass so the observer can collect E[x²]. Does not quantize — delegates to the subsequentQuantizationModifier/GPTQModifier.imatrix_mse observer (
observers/imatrix.py): extends the MSE grid search with importance weighting:err = sum(importance * |Q(w) - w|^p). Owns the E[x²] collection lifecycle viaObserver.init(module)/Observer.detach(module). Supports CHANNEL, GROUP, TENSOR_GROUP, and BLOCK strategies viaflatten_for_calibration. Falls back to uniform MSE when importance is unavailable (strict=Falsedefault).Usage:
Composes with GPTQ:
Results (W4A16, WikiText-2 token-level PPL, 141 chunks x 2048,
open_platypuscalibration 512 samples):Llama-3.1-8B, group_size=128:
memoryless_minmaximatrix_mseimatrix_mseimatrix_mse(tuned)Llama-3.1-8B, group_size=32:
memoryless_minmaximatrix_mseimatrix_mseimatrix_mse(tuned)Llama-3.1-70B, group_size=128:
memoryless_minmaximatrix_mseimatrix_mse(tuned)With default settings, GPTQ +
imatrix_mseimproves GPTQ by 0.07 PPL at gs128 and 0.04 at gs32. With tuned observer settings (norm=3.2, maxshrink=0.95, maxgrow=0.10), RTNimatrix_mseoutperforms GPTQ at both group sizes (~5min gs128, ~14min gs32 vs ~35min GPTQ). On 70B, iMatrix reduces the W4 degradation from +1.66 (minmax) to +0.60 (tuned), a 2.8x improvement.Eval method: GPTQ paper (concatenate WikiText-2 test, non-overlapping segments of 2048 tokens,
exp(mean(NLLs))).RFC #2456
TEST PLAN:
45 tests added (all CPU, no GPU required):
test_imatrix.py(26 tests): grid search with non-uniform importance, CHANNEL/GROUP/TENSOR_GROUP/BLOCK strategies, actorderg_idxreordering, strict vs non-strict fallback, validation (shape, dtype, finite, non-negative)test_imatrix_gatherer.py(15 tests): E[x²] collection, ignore list, accumulation correctness, hook cleanup, weight immutability, on_event lifecycletest_e2e_integration.py(4 tests): full pipeline viaoneshot()onnm-testing/tinysmokellama-3.2, gatherer-only, observer fallback without gatherer, regex targets