-
Notifications
You must be signed in to change notification settings - Fork 161
Bug fix for AutoQuantize: Add quantization skipping back; Add disable_layers to the search; cleaner names for search configs #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughQuantRecipe and AutoQuantize were redesigned: quant formats can be strings/dicts/None with derived names, per-layer recipe choices replace a single global recipe, disabled-layer filtering was added, recipes are deduplicated, and constraints switched to an effective_bits model; scoring, costs, and memory/gradient handling were updated. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Model as model_quant.auto_quantize
participant Searcher as AutoQuantizeSearcher
participant Recipe as QuantRecipe
User->>Model: auto_quantize(formats, disabled_layers, constraints)
Model->>Recipe: get_auto_name_for_config(cfg) for each format
Model->>Searcher: pass [(cfg,name),...], disabled_layers, constraints
activate Searcher
Searcher->>Searcher: build QuantRecipe objects (include QuantRecipe(None)), dedupe & name
Searcher->>Searcher: assign per-module QuantRecipeHparam(choices) (merge rules)
Searcher->>Searcher: apply disabled_layers → replace choices with no-quant where matched
Searcher->>Searcher: per-recipe cost/score estimation, memory & gradient management
Searcher->>Searcher: evaluate effective_bits constraint, select best recipe set
Searcher-->>Model: return best quantization plan (includes effective_bits)
deactivate Searcher
Model-->>User: return quantized model/results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
See the search output from Shreyas: (All layers are quantized with INT8 after search)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
modelopt/torch/quantization/algorithms.py (5)
304-312
: Make recipe ordering deterministic; avoid set-induced nondeterminism.Using set(...) here can yield non-deterministic ordering across ranks/runs when multiple formats share the same compression. Sort is stable only w.r.t. the input sequence, but a set destroys that. This can lead to per-rank differences in formats[] ordering even if the final DP/TP sync masks some outcomes.
Refactor to stable de-duplication and deterministic sort with a tie-breaker:
@staticmethod def _get_search_recipes(quantization_formats): - return sorted( - set( - [ - QuantRecipe(quant_cfg=q, quant_format_idx=i) - for i, q in enumerate(quantization_formats) - ] - + [QuantRecipe(quant_cfg=None, quant_format_idx=None)] - ) - ) + # Build in declared order and append the "no-quant" recipe + recipes = [ + QuantRecipe(quant_cfg=q, quant_format_idx=i) + for i, q in enumerate(quantization_formats) + ] + recipes.append(QuantRecipe(quant_cfg=None, quant_format_idx=None)) + + # Stable de-duplication via hash/eq while preserving first occurrence + uniq, seen = [], set() + for r in recipes: + if r not in seen: + uniq.append(r) + seen.add(r) + + # Deterministic order: primary by compression, tie-break by string representation + return sorted(uniq, key=lambda r: (r.compression, str(r)))
341-343
: Outdated TODO contradicts this PR.Comment says “remove the no-quant recipe” but this PR intentionally adds/retains it. Remove to avoid confusion.
- # TODO: remove the no-quant recipe
103-106
: Clearer label for the no-quant recipe in logs.“quantization_formats[None]” is a bit opaque. Prefer “no-quant”.
- self.str_repr = ( - f"quantization_formats[{quant_format_idx}]:effective-bits-{self.compression * 16}" - ) + label = ( + "no-quant" if quant_format_idx is None else f"quantization_formats[{quant_format_idx}]" + ) + self.str_repr = f"{label}:effective-bits-{self.compression * 16}"
161-164
: Redundant None-recipe injection; also suffers from set() nondeterminism.None is added here and in _get_search_recipes(). Keeping both is harmless due to de-dup, but it’s redundant and this set() also scrambles order. Either rely on a single source (preferred), or make this block deterministic too.
- choices = sorted(set(choices) | {QuantRecipe(quant_cfg=None)}) + # Ensure "no-quant" is present while preserving declared order + tmp = list(choices) + [QuantRecipe(quant_cfg=None)] + uniq, seen = [], set() + for r in tmp: + if r not in seen: + uniq.append(r) + seen.add(r) + choices = sorted(uniq, key=lambda r: (r.compression, str(r)))
121-127
: Safer equality: return NotImplemented on type mismatch.Asserting the type inside eq can raise unexpectedly during hashing/collections operations. Return NotImplemented when other is not a QuantRecipe.
- def __eq__(self, other: object): - assert isinstance(other, QuantRecipe) - return self.config == other.config + def __eq__(self, other: object): + if not isinstance(other, QuantRecipe): + return NotImplemented + return self.config == other.config
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modelopt/torch/quantization/algorithms.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
This "fix", if it really does fix, would be strange because we always put No quant in the choices as in:
Something else is not right. We should dig into it. Where is the link to the issue? I'll check. |
I see. The actual searched configs do not have None quantization. See
and
|
@shengliangxu let me actually search by hparam choices instead of |
OK, I read again this part of code, the change here seems correct. We need the enumeration of None to include non-quant in the search. I also searched the git history to see how this happen. So we previously add None in the quantization_format arg in most code paths, especially in the hf_ptq.py script, that's why we did not hit the issue. But there were already code paths without adding None, but we did not verify the no-quant in those code paths apparently. |
586766f
to
d979c5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
modelopt/torch/quantization/algorithms.py (5)
67-76
: Clamp tuple num_bits compression to [0, 1].For tuple num_bits you return (sum(bits)+1)/16 which can exceed 1.0 (e.g., (8,8) -> 17/16). That violates the stated 0..1 range and can inflate costs. Clamp and drop the +1.
Apply this diff:
- if isinstance(quantizer_attr_cfg.num_bits, tuple): - return (sum(quantizer_attr_cfg.num_bits) + 1) / 16 + if isinstance(quantizer_attr_cfg.num_bits, tuple): + # Approximate combined-bit formats; keep within [0, 1] + ratio = sum(quantizer_attr_cfg.num_bits) / 16 + return max(0.0, min(1.0, ratio))
183-186
: Default active recipe should be “no‑quant” to match docs.You sort choices and set original to choices[0] (most aggressive). Given we “add no‑quant by default”, consider making NONE the original for clearer semantics and safer defaults.
Apply this diff:
- super().__init__(choices, original=choices[0]) + # Prefer no‑quant as the original/default when present + no_quant = next((c for c in choices if c.compression >= 1.0), choices[-1]) + super().__init__(choices, original=no_quant)
193-211
: Potential heavy re-instantiation of quantizers per recipe.Creating fresh TensorQuantizer instances for every (module, recipe) can be memory-heavy on large models. Consider lazy materialization on first activation or pooling instances.
367-368
: Prefer driving via the hparam API to avoid attribute shadowing.Directly assigning module.quant_recipe could shadow the registered Hparam. Use the Hparam to set NONE explicitly.
Apply this diff:
- module.quant_recipe = QuantRecipe(quant_cfg=None) + # Switch to no‑quant via the registered hparam + qr_hp = module.get_hparam("quant_recipe") + none_recipe = next(c for c in qr_hp.choices if c.compression >= 1.0) + qr_hp.active = none_recipe
720-725
: Minor: remove leading space in formatted effective bits.Cosmetic nit: f"{x: .2f}" adds a leading space for positives.
Apply this diff:
- f"AutoQuantize effective bits from search: {effective_bits_from_search: .2f}" + f"AutoQuantize effective bits from search: {effective_bits_from_search:.2f}"modelopt/torch/quantization/model_quant.py (1)
409-422
: Preprocess formats with stable display names — good; consider aligning defaults.The tuple (cfg, name) handoff is right. Minor: default formats here use NVFP4_AWQ_LITE_CFG while the searcher’s default uses NVFP4_DEFAULT_CFG; align to reduce surprises.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
modelopt/torch/quantization/algorithms.py
(19 hunks)modelopt/torch/quantization/model_quant.py
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
modelopt/torch/quantization/model_quant.py (1)
modelopt/torch/quantization/algorithms.py (2)
QuantRecipe
(84-166)get_auto_name_for_config
(119-128)
modelopt/torch/quantization/algorithms.py (6)
modelopt/torch/quantization/config.py (2)
QuantizeConfig
(1114-1128)QuantizerAttributeConfig
(649-943)modelopt/torch/quantization/conversion.py (1)
set_quantizer_by_cfg
(206-237)modelopt/torch/quantization/nn/modules/quant_module.py (2)
QuantLinearConvBase
(129-169)QuantModule
(37-96)modelopt/torch/quantization/nn/modules/tensor_quantizer.py (3)
SequentialQuantizer
(1146-1254)TensorQuantizer
(60-1143)enable
(400-402)modelopt/torch/quantization/utils.py (1)
is_quantized_linear
(246-256)modelopt/torch/trace/symbols.py (1)
named_modules
(444-447)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (9)
modelopt/torch/quantization/algorithms.py (7)
110-112
: KV output quantizer forced off — good call.This ensures monotonic weight compression vs. accuracy across recipes and avoids skewing search.
116-129
: Auto‑naming and explicit NONE support look solid.Deriving names (including "NONE") simplifies UX and logging.
327-336
: Recipe enumeration with dedup is correct and robust.Accepting (cfg, name) tuples plus strings and deduping via QuantRecipe looks good.
468-515
: disabled_layers integration LGTM.Pattern-based gating before hparam insertion is clean and fixes the earlier mismatch where disabling happened post-search.
628-631
: Constraint translation is correct.effective_bits → weight_size_after_compression upper bound mapping is consistent.
636-655
: Use of hparam.choices in scoring fixes the root cause.Iterating the per-layer choices (including NONE) resolves the “no‑quant missing from search” bug.
321-326
: Incorrect — QuantLinearConvBase already inherits QuantModule via QuantInputBase
QuantLinearConvBase(QuantInputBase) is declared in modelopt/torch/quantization/nn/modules/quant_module.py:129; QuantInputBase(QuantModule) at :102; QuantModule at :37 — the predicate "(is_quantized_linear(module) or isinstance(module, QuantLinearConvBase)) and isinstance(module, QuantModule)" will include QuantLinearConvBase instances.Likely an incorrect or invalid review comment.
modelopt/torch/quantization/model_quant.py (2)
34-35
: Importing QuantRecipe for naming — good integration.Keeps display-name derivation centralized with the recipe logic.
438-440
: Passing disabled_layers into the searcher fixes the previous post‑hoc disable.Nice cleanup and consistent with new enumeration.
…_layers to the search; cleaner names for search configs Signed-off-by: realAsma <[email protected]> minor Signed-off-by: realAsma <[email protected]>
9e43a0f
to
892d293
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/torch/quantization/algorithms.py (2)
366-391
: Use Hparam.active to switch recipes during scoring (avoid assigning plain attribute).
module.quant_recipe = ...
relies on dynamic attribute plumbing and recreates a new recipe instance. Switching viaget_hparam("quant_recipe").active
is clearer and guarantees the quantizer set is swapped using the prebuilt choices. Also reuse the existing “no‑quant” choice.def _estimate_auto_quantize_scores(self): # TODO: remove the no-quant recipe def auto_quantize_score_estimate_forward(module, input, *args, **kwargs): - module.quant_recipe = QuantRecipe(quant_cfg=None) + hparam = module.get_hparam("quant_recipe") + # Select the existing no-quant choice (compression >= 1.0). + no_quant = next(r for r in hparam.choices if r.compression >= 1.0) + hparam.active = no_quant output = module._forward_original(input, *args, **kwargs) @@ module.output_diff_dict = {} with torch.no_grad(): - for recipe in module.get_hparam("quant_recipe").choices: + for recipe in hparam.choices: if recipe.compression >= 1.0: continue - module.quant_recipe = recipe + hparam.active = recipe output_diff = module._forward_original(input, *args, **kwargs)
639-655
: Monotonic score capping bug — cap isn’t carried forward.
prev_score
is set to the rawscore
, not the capped value, so a later item can increase again. Use the capped value to enforce non‑increasing scores across recipes.scores.append(min(score, prev_score)) costs.append(cost) - prev_score = score + prev_score = min(score, prev_score)
🧹 Nitpick comments (7)
modelopt/torch/quantization/algorithms.py (7)
116-117
: Stabilize string representation (format effective bits).Printing raw floats can produce long/uneven strings and affect hashing/equality if formatting ever changes. Format to 2 decimals (consistent with later logs).
- self._str_repr: str = f"{name}(effective-bits: {self.compression * 16})" + self._str_repr: str = f"{name}(effective-bits: {self.compression * 16:.2f})"
144-150
: Make eq robust for non-QuantRecipe comparisons.
assert isinstance(other, QuantRecipe)
can raise unexpectedly; instead returnNotImplemented
for foreign types.- def __eq__(self, other: object): - assert isinstance(other, QuantRecipe) - return self._str_repr == other._str_repr + def __eq__(self, other: object): + if not isinstance(other, QuantRecipe): + return NotImplemented + return self._str_repr == other._str_repr
118-129
: Naming: prefer “None” (title case) and validatemtq_config.choices
presence.Minor: return “None” instead of “NONE” for consistency with printed examples; also,
mtq_config.choices
must exist and enumerate valid format symbols.- if quant_cfg is None: - return "NONE" + if quant_cfg is None: + return "None"Please confirm
modelopt/torch/quantization/config.py
(or its init) defineschoices
as the canonical list of format names; otherwise this loop may miss matches.
110-113
: Scope of disabling output quantizer may be too broad.
self.config.quant_cfg["*output_quantizer"] = QuantizerAttributeConfig(enable=False)
disables all output quantizers, not only KV cache. If the intent is to disable only KV cache quantization, consider scoping this with a narrower wildcard (e.g., to KV modules) or gate by a flag.
179-185
: Default ‘original’ choice becomes the most aggressive recipe. Intentional?Because
choices
are sorted bycompression
,original=choices[0]
selects the most aggressive (smallest bits). If you intended the baseline to be no‑quant for reference, consider settingoriginal
to the no‑quant choice.Example tweak:
- super().__init__(choices, original=choices[0]) + # Prefer no-quant as original if present; else fallback to first. + no_quant = next((c for c in choices if c.compression >= 1.0), choices[0]) + super().__init__(choices, original=no_quant)
401-410
: Enabling grad and hooks for all parameters is heavy.You note this TODO; scoping to the first parameter per module (or only the AQ-target modules) will reduce overhead notably on large LLMs.
689-699
: Non-optimal LP: guard downstream usage and report effective bits meaningfully.If status isn’t Optimal,
selections
might be partial or undefined. Consider skipping best-recipe application and compute effective bits from the constraint target instead, or clearly log that the value is undefined.Also applies to: 720-728
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
modelopt/torch/quantization/algorithms.py
(19 hunks)modelopt/torch/quantization/model_quant.py
(3 hunks)tests/unit/torch/quantization/test_autoquant.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- modelopt/torch/quantization/model_quant.py
- tests/unit/torch/quantization/test_autoquant.py
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/quantization/algorithms.py (6)
modelopt/torch/quantization/config.py (2)
QuantizeConfig
(1114-1128)QuantizerAttributeConfig
(649-943)modelopt/torch/quantization/conversion.py (1)
set_quantizer_by_cfg
(206-237)modelopt/torch/quantization/nn/modules/quant_module.py (2)
QuantLinearConvBase
(129-169)QuantModule
(37-96)modelopt/torch/quantization/nn/modules/tensor_quantizer.py (3)
SequentialQuantizer
(1146-1254)TensorQuantizer
(60-1143)enable
(400-402)modelopt/torch/quantization/utils.py (1)
is_quantized_linear
(246-256)modelopt/torch/utils/logging.py (1)
print_rank_0
(92-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (1)
modelopt/torch/quantization/algorithms.py (1)
586-589
: Double-check ModeloptStateManager.pop() semantics.
ModeloptStateManager(self.model).state_dict().pop()
without a key assumes a list-like object. If this returns adict
, it will raise. Please confirm the API returns a poppable stack here.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #332 +/- ##
==========================================
+ Coverage 73.82% 73.84% +0.02%
==========================================
Files 172 172
Lines 17438 17455 +17
==========================================
+ Hits 12873 12890 +17
Misses 4565 4565 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: ? Bug fix, feature improvement
Overview: ?
In the latest main branch, we do not support skipping quantization with auto_quantize (though the doc string says we by default support skipping quantization).
This PR fixes that.
The above bug was previously causing Shreyas Misra 's Llama 3.1 8B
auto_quantize
to fail. Basically all layers were quantized after his search becauseauto_quantize
was not given the option to skip quantization (well exceptlm_head
because we force disabling quantization oflm_head
after the search). I am not sure when this bug was introduced.In addition, this PR does the following too:
Add disable_layers configuration to search. This way search process accounts for the disabled layers. Previously layers were disabled at the end of search undermining search and user constraint.
Get Cleaner names for search formats
Previously AutoQuantize printed the following in output log:
This was not helpful and confusing. This caused us to not realize search is broken. After this PR, the output log looks as following
Usage
# Add a code snippet demonstrating how to use this
Testing
Results:
Model: Llama-3.1-8B-Instruct
Evaluation: MMLU from
lm_eval
, fake quantizationmtq.quantize
API, LM Head is not quantizeddisabled_layers=[*lm_head*]
Additional Information
@shreyas Misra
Summary by CodeRabbit
New Features
Improvements
Tests