Skip to content

Conversation

realAsma
Copy link
Contributor

@realAsma realAsma commented Sep 17, 2025

What does this PR do?

Type of change: ? Bug fix, feature improvement

Overview: ?

In the latest main branch, we do not support skipping quantization with auto_quantize (though the doc string says we by default support skipping quantization).

This PR fixes that.

The above bug was previously causing Shreyas Misra 's Llama 3.1 8B auto_quantize to fail. Basically all layers were quantized after his search because auto_quantize was not given the option to skip quantization (well except lm_head because we force disabling quantization of lm_head after the search). I am not sure when this bug was introduced.

In addition, this PR does the following too:

  1. Add disable_layers configuration to search. This way search process accounts for the disabled layers. Previously layers were disabled at the end of search undermining search and user constraint.

  2. Get Cleaner names for search formats
    Previously AutoQuantize printed the following in output log:

AutoQuantize best recipe for model.layers.30.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.31.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.31.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.31.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.31.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for lm_head: quantization_formats[0]:effective-bits-8.0

This was not helpful and confusing. This caused us to not realize search is broken. After this PR, the output log looks as following

AutoQuantize best recipe for model.layers.30.mlp.down_proj: INT8_SMOOTHQUANT_CFG(effective-bits: 8.0)
AutoQuantize best recipe for model.layers.31.self_attn.q_proj: INT8_SMOOTHQUANT_CFG(effective-bits: 8.0)
AutoQuantize best recipe for model.layers.31.self_attn.o_proj: INT8_SMOOTHQUANT_CFG(effective-bits: 8.0)
AutoQuantize best recipe for model.layers.31.mlp.gate_proj: INT8_SMOOTHQUANT_CFG(effective-bits: 8.0)
AutoQuantize best recipe for model.layers.31.mlp.down_proj: None(effective-bits: 16.0)
AutoQuantize best recipe for lm_head: None(effective-bits: 16.0)

Usage

# Add a code snippet demonstrating how to use this

Testing

Results:

Model: Llama-3.1-8B-Instruct
Evaluation: MMLU from lm_eval, fake quantization

Method Effective bits MMLU Accuracy Other Comments
Naive INT8 SMOOTHQUANT 8.56 52.47 using mtq.quantize API, LM Head is not quantized
AutoQuantize, no manual setting 9.0 58.84 AutoQuantize quantizes LM Head
AutoQuantize, LM Head disabled explicitly 9.0 58.63 LM Head quantization disabled by setting disabled_layers=[*lm_head*]

Additional Information

@shreyas Misra

Summary by CodeRabbit

  • New Features

    • Named/string quantization formats with display names and disabled-layers support to exclude layers from automatic quantization.
  • Improvements

    • Added a no-quant default option, per-layer recipe choices with deduplication, and search reporting using an effective-bits constraint for size/accuracy tradeoffs.
  • Tests

    • Unit tests updated to reflect naming, per-layer choices, and effective-bits validations.

@realAsma realAsma requested a review from a team as a code owner September 17, 2025 20:03
Copy link

coderabbitai bot commented Sep 17, 2025

Walkthrough

QuantRecipe and AutoQuantize were redesigned: quant formats can be strings/dicts/None with derived names, per-layer recipe choices replace a single global recipe, disabled-layer filtering was added, recipes are deduplicated, and constraints switched to an effective_bits model; scoring, costs, and memory/gradient handling were updated.

Changes

Cohort / File(s) Summary
Quantization algorithms & recipes
modelopt/torch/quantization/algorithms.py
Reworked QuantRecipe to accept `str
Model-level auto-quantize orchestration
modelopt/torch/quantization/model_quant.py
Converts quantization formats into (config, name) pairs using QuantRecipe.get_auto_name_for_config (assigns CUSTOM_{i} with warning if unnamed), passes disabled_layers into the searcher, and removes later manual disabling.
Unit tests for autoquant
tests/unit/torch/quantization/test_autoquant.py
Tests updated to the new QuantRecipeHparam ctor (no original=), parameterization expanded (includes min_bits/search_bits), constraints now use effective_bits and assert bounds, TransformerBlock projection recipe expectations adjusted, and an obsolete test removed.

Sequence Diagram(s)

sequenceDiagram
  actor User
  participant Model as model_quant.auto_quantize
  participant Searcher as AutoQuantizeSearcher
  participant Recipe as QuantRecipe

  User->>Model: auto_quantize(formats, disabled_layers, constraints)
  Model->>Recipe: get_auto_name_for_config(cfg) for each format
  Model->>Searcher: pass [(cfg,name),...], disabled_layers, constraints
  activate Searcher
  Searcher->>Searcher: build QuantRecipe objects (include QuantRecipe(None)), dedupe & name
  Searcher->>Searcher: assign per-module QuantRecipeHparam(choices) (merge rules)
  Searcher->>Searcher: apply disabled_layers → replace choices with no-quant where matched
  Searcher->>Searcher: per-recipe cost/score estimation, memory & gradient management
  Searcher->>Searcher: evaluate effective_bits constraint, select best recipe set
  Searcher-->>Model: return best quantization plan (includes effective_bits)
  deactivate Searcher
  Model-->>User: return quantized model/results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through configs, name each one neat,
Strings, dicts, and none — a tasty treat.
Per-layer choices snug in my lair,
Disabled brambles fenced with care.
I count the bits and thump — the search is complete! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title accurately and specifically summarizes the main changes—restoring no-quant (skipping) in AutoQuantize, adding disabled_layers to the search, and producing cleaner names for search configs—which matches the objectives and diffs; it is a single clear sentence tied to the changes and not misleading. It is slightly long but remains concise and informative for repository history.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch asma/bug_fix_aq_none_default

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@realAsma
Copy link
Contributor Author

See the search output from Shreyas: (All layers are quantized with INT8 after search)


Result - Optimal solution found

Objective value:                0.00727525
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             0.00
Time (Wallclock seconds):       0.00

Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.00

AutoQuantize best recipe for model.layers.0.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.0.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.0.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.0.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.1.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.1.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.1.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.1.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.2.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.2.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.2.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.2.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.3.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.3.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.3.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.3.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.4.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.4.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.4.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.4.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.5.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.5.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.5.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.5.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.6.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.6.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.6.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.6.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.7.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.7.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.7.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.7.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.8.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.8.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.8.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.8.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.9.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.9.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.9.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.9.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.10.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.10.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.10.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.10.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.11.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.11.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.11.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.11.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.12.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.12.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.12.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.12.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.13.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.13.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.13.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.13.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.14.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.14.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.14.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.14.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.15.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.15.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.15.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.15.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.16.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.16.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.16.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.16.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.17.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.17.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.17.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.17.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.18.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.18.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.18.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.18.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.19.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.19.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.19.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.19.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.20.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.20.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.20.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.20.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.21.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.21.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.21.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.21.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.22.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.22.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.22.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.22.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.23.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.23.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.23.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.23.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.24.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.24.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.24.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.24.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.25.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.25.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.25.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.25.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.26.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.26.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.26.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.26.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.27.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.27.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.27.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.27.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.28.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.28.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.28.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.28.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.29.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.29.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.29.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.29.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.30.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.30.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.30.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.30.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.31.self_attn.q_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.31.self_attn.o_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.31.mlp.gate_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for model.layers.31.mlp.down_proj: quantization_formats[0]:effective-bits-8.0
AutoQuantize best recipe for lm_head: quantization_formats[0]:effective-bits-8.0

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
modelopt/torch/quantization/algorithms.py (5)

304-312: Make recipe ordering deterministic; avoid set-induced nondeterminism.

Using set(...) here can yield non-deterministic ordering across ranks/runs when multiple formats share the same compression. Sort is stable only w.r.t. the input sequence, but a set destroys that. This can lead to per-rank differences in formats[] ordering even if the final DP/TP sync masks some outcomes.

Refactor to stable de-duplication and deterministic sort with a tie-breaker:

     @staticmethod
     def _get_search_recipes(quantization_formats):
-        return sorted(
-            set(
-                [
-                    QuantRecipe(quant_cfg=q, quant_format_idx=i)
-                    for i, q in enumerate(quantization_formats)
-                ]
-                + [QuantRecipe(quant_cfg=None, quant_format_idx=None)]
-            )
-        )
+        # Build in declared order and append the "no-quant" recipe
+        recipes = [
+            QuantRecipe(quant_cfg=q, quant_format_idx=i)
+            for i, q in enumerate(quantization_formats)
+        ]
+        recipes.append(QuantRecipe(quant_cfg=None, quant_format_idx=None))
+
+        # Stable de-duplication via hash/eq while preserving first occurrence
+        uniq, seen = [], set()
+        for r in recipes:
+            if r not in seen:
+                uniq.append(r)
+                seen.add(r)
+
+        # Deterministic order: primary by compression, tie-break by string representation
+        return sorted(uniq, key=lambda r: (r.compression, str(r)))

341-343: Outdated TODO contradicts this PR.

Comment says “remove the no-quant recipe” but this PR intentionally adds/retains it. Remove to avoid confusion.

-        # TODO: remove the no-quant recipe

103-106: Clearer label for the no-quant recipe in logs.

“quantization_formats[None]” is a bit opaque. Prefer “no-quant”.

-        self.str_repr = (
-            f"quantization_formats[{quant_format_idx}]:effective-bits-{self.compression * 16}"
-        )
+        label = (
+            "no-quant" if quant_format_idx is None else f"quantization_formats[{quant_format_idx}]"
+        )
+        self.str_repr = f"{label}:effective-bits-{self.compression * 16}"

161-164: Redundant None-recipe injection; also suffers from set() nondeterminism.

None is added here and in _get_search_recipes(). Keeping both is harmless due to de-dup, but it’s redundant and this set() also scrambles order. Either rely on a single source (preferred), or make this block deterministic too.

-        choices = sorted(set(choices) | {QuantRecipe(quant_cfg=None)})
+        # Ensure "no-quant" is present while preserving declared order
+        tmp = list(choices) + [QuantRecipe(quant_cfg=None)]
+        uniq, seen = [], set()
+        for r in tmp:
+            if r not in seen:
+                uniq.append(r)
+                seen.add(r)
+        choices = sorted(uniq, key=lambda r: (r.compression, str(r)))

121-127: Safer equality: return NotImplemented on type mismatch.

Asserting the type inside eq can raise unexpectedly during hashing/collections operations. Return NotImplemented when other is not a QuantRecipe.

-    def __eq__(self, other: object):
-        assert isinstance(other, QuantRecipe)
-        return self.config == other.config
+    def __eq__(self, other: object):
+        if not isinstance(other, QuantRecipe):
+            return NotImplemented
+        return self.config == other.config
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 682bf6d and 586766f.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/algorithms.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait

@shengliangxu
Copy link

shengliangxu commented Sep 17, 2025

This "fix", if it really does fix, would be strange because we always put No quant in the choices as in:

choices = sorted(set(choices) | {QuantRecipe(quant_cfg=None)})

Something else is not right. We should dig into it. Where is the link to the issue? I'll check.

@realAsma
Copy link
Contributor Author

This "fix", if it really does fix, would be strange because we always put No quant in the choices as in:

choices = sorted(set(choices) | {QuantRecipe(quant_cfg=None)})

Something else is not right. We should dig into it. Where is the link to the issue? I'll check.

I see. The actual searched configs do not have None quantization. See

search_recipes = self._get_search_recipes(self.config["quantization_formats"])

and
def _get_search_recipes(quantization_formats):

@realAsma
Copy link
Contributor Author

@shengliangxu let me actually search by hparam choices instead of self._get_search_recipes. The problem here is that the search is not performed by the output of _get_search_recipes not hparam.choices. This will also enable us to easily disable search for disabled_layers (instead of disabling after search).

@shengliangxu
Copy link

@shengliangxu let me actually search by hparam choices instead of self._get_search_recipes. The problem here is that the search is not performed by the output of _get_search_recipes not hparam.choices. This will also enable us to easily disable search for disabled_layers (instead of disabling after search).

OK, I read again this part of code, the change here seems correct. We need the enumeration of None to include non-quant in the search.

I also searched the git history to see how this happen. So we previously add None in the quantization_format arg in most code paths, especially in the hf_ptq.py script, that's why we did not hit the issue. But there were already code paths without adding None, but we did not verify the no-quant in those code paths apparently.

@realAsma realAsma force-pushed the asma/bug_fix_aq_none_default branch from 586766f to d979c5d Compare September 18, 2025 18:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
modelopt/torch/quantization/algorithms.py (5)

67-76: Clamp tuple num_bits compression to [0, 1].

For tuple num_bits you return (sum(bits)+1)/16 which can exceed 1.0 (e.g., (8,8) -> 17/16). That violates the stated 0..1 range and can inflate costs. Clamp and drop the +1.

Apply this diff:

-            if isinstance(quantizer_attr_cfg.num_bits, tuple):
-                return (sum(quantizer_attr_cfg.num_bits) + 1) / 16
+            if isinstance(quantizer_attr_cfg.num_bits, tuple):
+                # Approximate combined-bit formats; keep within [0, 1]
+                ratio = sum(quantizer_attr_cfg.num_bits) / 16
+                return max(0.0, min(1.0, ratio))

183-186: Default active recipe should be “no‑quant” to match docs.

You sort choices and set original to choices[0] (most aggressive). Given we “add no‑quant by default”, consider making NONE the original for clearer semantics and safer defaults.

Apply this diff:

-        super().__init__(choices, original=choices[0])
+        # Prefer no‑quant as the original/default when present
+        no_quant = next((c for c in choices if c.compression >= 1.0), choices[-1])
+        super().__init__(choices, original=no_quant)

193-211: Potential heavy re-instantiation of quantizers per recipe.

Creating fresh TensorQuantizer instances for every (module, recipe) can be memory-heavy on large models. Consider lazy materialization on first activation or pooling instances.


367-368: Prefer driving via the hparam API to avoid attribute shadowing.

Directly assigning module.quant_recipe could shadow the registered Hparam. Use the Hparam to set NONE explicitly.

Apply this diff:

-            module.quant_recipe = QuantRecipe(quant_cfg=None)
+            # Switch to no‑quant via the registered hparam
+            qr_hp = module.get_hparam("quant_recipe")
+            none_recipe = next(c for c in qr_hp.choices if c.compression >= 1.0)
+            qr_hp.active = none_recipe

720-725: Minor: remove leading space in formatted effective bits.

Cosmetic nit: f"{x: .2f}" adds a leading space for positives.

Apply this diff:

-                f"AutoQuantize effective bits from search: {effective_bits_from_search: .2f}"
+                f"AutoQuantize effective bits from search: {effective_bits_from_search:.2f}"
modelopt/torch/quantization/model_quant.py (1)

409-422: Preprocess formats with stable display names — good; consider aligning defaults.

The tuple (cfg, name) handoff is right. Minor: default formats here use NVFP4_AWQ_LITE_CFG while the searcher’s default uses NVFP4_DEFAULT_CFG; align to reduce surprises.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d979c5d and 9e43a0f.

📒 Files selected for processing (2)
  • modelopt/torch/quantization/algorithms.py (19 hunks)
  • modelopt/torch/quantization/model_quant.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
modelopt/torch/quantization/model_quant.py (1)
modelopt/torch/quantization/algorithms.py (2)
  • QuantRecipe (84-166)
  • get_auto_name_for_config (119-128)
modelopt/torch/quantization/algorithms.py (6)
modelopt/torch/quantization/config.py (2)
  • QuantizeConfig (1114-1128)
  • QuantizerAttributeConfig (649-943)
modelopt/torch/quantization/conversion.py (1)
  • set_quantizer_by_cfg (206-237)
modelopt/torch/quantization/nn/modules/quant_module.py (2)
  • QuantLinearConvBase (129-169)
  • QuantModule (37-96)
modelopt/torch/quantization/nn/modules/tensor_quantizer.py (3)
  • SequentialQuantizer (1146-1254)
  • TensorQuantizer (60-1143)
  • enable (400-402)
modelopt/torch/quantization/utils.py (1)
  • is_quantized_linear (246-256)
modelopt/torch/trace/symbols.py (1)
  • named_modules (444-447)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (9)
modelopt/torch/quantization/algorithms.py (7)

110-112: KV output quantizer forced off — good call.

This ensures monotonic weight compression vs. accuracy across recipes and avoids skewing search.


116-129: Auto‑naming and explicit NONE support look solid.

Deriving names (including "NONE") simplifies UX and logging.


327-336: Recipe enumeration with dedup is correct and robust.

Accepting (cfg, name) tuples plus strings and deduping via QuantRecipe looks good.


468-515: disabled_layers integration LGTM.

Pattern-based gating before hparam insertion is clean and fixes the earlier mismatch where disabling happened post-search.


628-631: Constraint translation is correct.

effective_bits → weight_size_after_compression upper bound mapping is consistent.


636-655: Use of hparam.choices in scoring fixes the root cause.

Iterating the per-layer choices (including NONE) resolves the “no‑quant missing from search” bug.


321-326: Incorrect — QuantLinearConvBase already inherits QuantModule via QuantInputBase
QuantLinearConvBase(QuantInputBase) is declared in modelopt/torch/quantization/nn/modules/quant_module.py:129; QuantInputBase(QuantModule) at :102; QuantModule at :37 — the predicate "(is_quantized_linear(module) or isinstance(module, QuantLinearConvBase)) and isinstance(module, QuantModule)" will include QuantLinearConvBase instances.

Likely an incorrect or invalid review comment.

modelopt/torch/quantization/model_quant.py (2)

34-35: Importing QuantRecipe for naming — good integration.

Keeps display-name derivation centralized with the recipe logic.


438-440: Passing disabled_layers into the searcher fixes the previous post‑hoc disable.

Nice cleanup and consistent with new enumeration.

@realAsma realAsma changed the title Bug fix for AutoQuantize: Add None quantization by default Bug fix for AutoQuantize: Add quantization skipping back; Add disable_layers to the search; cleaner names for search configs Sep 18, 2025
…_layers to the search; cleaner names for search configs

Signed-off-by: realAsma <[email protected]>

minor

Signed-off-by: realAsma <[email protected]>
@realAsma realAsma force-pushed the asma/bug_fix_aq_none_default branch from 9e43a0f to 892d293 Compare September 18, 2025 19:22
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
modelopt/torch/quantization/algorithms.py (2)

366-391: Use Hparam.active to switch recipes during scoring (avoid assigning plain attribute).

module.quant_recipe = ... relies on dynamic attribute plumbing and recreates a new recipe instance. Switching via get_hparam("quant_recipe").active is clearer and guarantees the quantizer set is swapped using the prebuilt choices. Also reuse the existing “no‑quant” choice.

 def _estimate_auto_quantize_scores(self):
     # TODO: remove the no-quant recipe
     def auto_quantize_score_estimate_forward(module, input, *args, **kwargs):
-        module.quant_recipe = QuantRecipe(quant_cfg=None)
+        hparam = module.get_hparam("quant_recipe")
+        # Select the existing no-quant choice (compression >= 1.0).
+        no_quant = next(r for r in hparam.choices if r.compression >= 1.0)
+        hparam.active = no_quant
         output = module._forward_original(input, *args, **kwargs)
@@
         module.output_diff_dict = {}
         with torch.no_grad():
-            for recipe in module.get_hparam("quant_recipe").choices:
+            for recipe in hparam.choices:
                 if recipe.compression >= 1.0:
                     continue
-                module.quant_recipe = recipe
+                hparam.active = recipe
                 output_diff = module._forward_original(input, *args, **kwargs)

639-655: Monotonic score capping bug — cap isn’t carried forward.

prev_score is set to the raw score, not the capped value, so a later item can increase again. Use the capped value to enforce non‑increasing scores across recipes.

             scores.append(min(score, prev_score))
             costs.append(cost)
-            prev_score = score
+            prev_score = min(score, prev_score)
🧹 Nitpick comments (7)
modelopt/torch/quantization/algorithms.py (7)

116-117: Stabilize string representation (format effective bits).

Printing raw floats can produce long/uneven strings and affect hashing/equality if formatting ever changes. Format to 2 decimals (consistent with later logs).

-        self._str_repr: str = f"{name}(effective-bits: {self.compression * 16})"
+        self._str_repr: str = f"{name}(effective-bits: {self.compression * 16:.2f})"

144-150: Make eq robust for non-QuantRecipe comparisons.

assert isinstance(other, QuantRecipe) can raise unexpectedly; instead return NotImplemented for foreign types.

-    def __eq__(self, other: object):
-        assert isinstance(other, QuantRecipe)
-        return self._str_repr == other._str_repr
+    def __eq__(self, other: object):
+        if not isinstance(other, QuantRecipe):
+            return NotImplemented
+        return self._str_repr == other._str_repr

118-129: Naming: prefer “None” (title case) and validate mtq_config.choices presence.

Minor: return “None” instead of “NONE” for consistency with printed examples; also, mtq_config.choices must exist and enumerate valid format symbols.

-        if quant_cfg is None:
-            return "NONE"
+        if quant_cfg is None:
+            return "None"

Please confirm modelopt/torch/quantization/config.py (or its init) defines choices as the canonical list of format names; otherwise this loop may miss matches.


110-113: Scope of disabling output quantizer may be too broad.

self.config.quant_cfg["*output_quantizer"] = QuantizerAttributeConfig(enable=False) disables all output quantizers, not only KV cache. If the intent is to disable only KV cache quantization, consider scoping this with a narrower wildcard (e.g., to KV modules) or gate by a flag.


179-185: Default ‘original’ choice becomes the most aggressive recipe. Intentional?

Because choices are sorted by compression, original=choices[0] selects the most aggressive (smallest bits). If you intended the baseline to be no‑quant for reference, consider setting original to the no‑quant choice.

Example tweak:

-        super().__init__(choices, original=choices[0])
+        # Prefer no-quant as original if present; else fallback to first.
+        no_quant = next((c for c in choices if c.compression >= 1.0), choices[0])
+        super().__init__(choices, original=no_quant)

401-410: Enabling grad and hooks for all parameters is heavy.

You note this TODO; scoping to the first parameter per module (or only the AQ-target modules) will reduce overhead notably on large LLMs.


689-699: Non-optimal LP: guard downstream usage and report effective bits meaningfully.

If status isn’t Optimal, selections might be partial or undefined. Consider skipping best-recipe application and compute effective bits from the constraint target instead, or clearly log that the value is undefined.

Also applies to: 720-728

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e43a0f and 892d293.

📒 Files selected for processing (3)
  • modelopt/torch/quantization/algorithms.py (19 hunks)
  • modelopt/torch/quantization/model_quant.py (3 hunks)
  • tests/unit/torch/quantization/test_autoquant.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • modelopt/torch/quantization/model_quant.py
  • tests/unit/torch/quantization/test_autoquant.py
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/quantization/algorithms.py (6)
modelopt/torch/quantization/config.py (2)
  • QuantizeConfig (1114-1128)
  • QuantizerAttributeConfig (649-943)
modelopt/torch/quantization/conversion.py (1)
  • set_quantizer_by_cfg (206-237)
modelopt/torch/quantization/nn/modules/quant_module.py (2)
  • QuantLinearConvBase (129-169)
  • QuantModule (37-96)
modelopt/torch/quantization/nn/modules/tensor_quantizer.py (3)
  • SequentialQuantizer (1146-1254)
  • TensorQuantizer (60-1143)
  • enable (400-402)
modelopt/torch/quantization/utils.py (1)
  • is_quantized_linear (246-256)
modelopt/torch/utils/logging.py (1)
  • print_rank_0 (92-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (1)
modelopt/torch/quantization/algorithms.py (1)

586-589: Double-check ModeloptStateManager.pop() semantics.

ModeloptStateManager(self.model).state_dict().pop() without a key assumes a list-like object. If this returns a dict, it will raise. Please confirm the API returns a poppable stack here.

Copy link

codecov bot commented Sep 18, 2025

Codecov Report

❌ Patch coverage is 90.90909% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.84%. Comparing base (d94fc1b) to head (892d293).
⚠️ Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/algorithms.py 91.80% 5 Missing ⚠️
modelopt/torch/quantization/model_quant.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #332      +/-   ##
==========================================
+ Coverage   73.82%   73.84%   +0.02%     
==========================================
  Files         172      172              
  Lines       17438    17455      +17     
==========================================
+ Hits        12873    12890      +17     
  Misses       4565     4565              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@realAsma realAsma merged commit be95a10 into main Sep 18, 2025
22 checks passed
@realAsma realAsma deleted the asma/bug_fix_aq_none_default branch September 18, 2025 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants