fix: add norm calibration context for unit-offset RMSNorm (Gemma/Qwen3Next) by Yatimai · Pull Request #2500 · vllm-project/llm-compressor

Yatimai · 2026-03-21T20:56:12Z

SUMMARY

Some architectures (Gemma, Gemma2, Gemma3, Qwen3Next) use an offset normalization where the forward computes output * (1 + weight) instead of output * weight. This breaks any modifier that smooths norm weights (AWQ, SmoothQuant, SpinQuant, QuIP) because dividing a (1+weight) parameter by scales produces 1 + weight/scales instead of (1 + weight)/scales.

Following @brian-dellabetta's suggestion, this adds a norm_calibration_context that temporarily replaces offset-norm modules with standard-norm equivalents during calibration, following the same pattern as moe_calibration_context. On entry, offset norms are replaced with CalibrationOffsetNorm modules (weight = 1 + original). On exit, modules are restored with updated weights (weight = smoothed - 1).

Only norms operating on hidden_size are converted. Norms operating on head_dim (e.g. q_norm/k_norm in Gemma3 attention) are skipped since no modifier smooths them.

TEST PLAN

Unit tests (8/8 passing):

Weight conversion and dtype preservation
Forward equivalence with original norm
Restore roundtrip (with and without smoothing)
Registry detection (positive and negative)
hidden_size filter: q_norm/k_norm correctly skipped

E2E validation:

Model	Modifier	Norms converted	Output
`google/gemma-2-2b-it`	AWQ W4A16	105	Coherent
`google/medgemma-27b-text-it`	AWQ W4A16	249 (373 total, 124 q/k skipped)	Coherent
upstream (no fix) on medgemma	AWQ W4A16	0	Garbage

Qwen3-Next architecture verified structurally: hidden_size=2048, head_dim=256, Qwen3NextRMSNorm uses same (1+weight) pattern. No smaller Qwen3-Next model exists for e2e testing (80B MoE only).

Fixes #2365
Fixes #2102
Related to #2202
Related to #2059

gemini-code-assist · 2026-03-21T20:56:29Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue in the AWQ quantization process that specifically affected Gemma models. It introduces a specialized handling for Gemma's unique RMSNorm implementation, which computes output * (1 + weight). By correctly adjusting the norm weights during the smoothing phase, the PR ensures that Gemma models can be properly quantized with AWQ, preventing the generation of garbage output and maintaining model coherence.

Highlights

Corrected Gemma RMSNorm Handling: Implemented a specific weight transformation for Gemma-family RMSNorms during AWQ smoothing to correctly account for their (1 + weight) computation, which was previously causing incorrect scaling.
Unit-Offset Norm Detection: Introduced a mechanism to identify Gemma, Gemma2, and Gemma3 RMSNorms by their class names, as these models use a unit-offset convention.
Validation and Regression Testing: Verified the fix on google/medgemma-27b-text-it (Gemma3), demonstrating coherent output with the fix, and confirmed no regression on Qwen2.5-0.5B.
New Unit Tests: Added unit tests for the _uses_unit_offset_norm function to ensure accurate detection of unit-offset norms and proper differentiation from standard norms.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-03-21T20:57:11Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Code Review

This pull request introduces a well-implemented fix for handling unit-offset RMSNorm in Gemma models during AWQ smoothing. The logic to detect and apply the correct scaling transformation for Gemma-family norms is sound and cleanly encapsulated. The addition of unit tests is also a valuable contribution. I've included one suggestion to enhance the test coverage for the new functionality.

tests/llmcompressor/modifiers/awq/test_base.py

brian-dellabetta

Hi @Yatimai , thanks for taking an initial stab at this, this has been on the docket but haven't had a chance to open a PR yet. I've left some comments, main concern is that this fix is specific to AWQ whereas this custom norm forward implementation would break other modifiers as well. A general solution would be preferred

src/llmcompressor/modifiers/awq/base.py

Yatimai · 2026-03-24T00:30:25Z

Update: reworked the fix based on @brian-dellabetta's review, instead of patching AWQ only, it's now a norm_calibration_context that handles the (1 + weight) conversion for all modifiers (AWQ, SmoothQuant, SpinQuant, QuIP). Tested e2e on medgemma-27b-text-it, output is coherent after AWQ W4A16.

brian-dellabetta

Hi @Yatimai , thanks for this, things are looking really good. I have one nit, if we can patch that and I'll run it on my side before approving and adding reviewers

src/llmcompressor/entrypoints/oneshot.py

brian-dellabetta

Hi @Yatimai , sorry a few more nits after 2nd pass. Will try this out now

src/llmcompressor/modeling/offset_norm.py

tests/llmcompressor/modeling/test_calib_offset_norm.py

src/llmcompressor/modeling/offset_norm.py

tests/llmcompressor/modeling/test_calib_offset_norm.py

…3Next) Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>

brian-dellabetta

Thanks @Yatimai for the contribution! This looks great to me. We may ultimately want to merge the moe and norm calibration contexts, as they share the same interface, but we can leave that off for now. Approving and will ping team for a second reviewer.

I was able to run AWQ on google/gemma-2-2b-it on this branch and get reasonable output, whereas on main it's gibberish

Yatimai · 2026-03-24T19:56:34Z

Thanks @brian-dellabetta for your help!

Jeevi10 · 2026-03-25T16:52:22Z

hey,
I ran an experiment with GPTQ and AWQ for google/medgemma-27b-text-it

google/medgemma-27b-text-it

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
careqa_en	1	none	5	acc	↑	0.856	±	0.0222

I used W4A16 for both GPTQ and AWQ

AWQ result's in fix/awq-gemma-rmsnorm in this branch

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
careqa_en	1	none	5	acc	↑	0.692	±	0.0293

GPTQ results in main branch

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
careqa_en	1	none	5	acc	↑	0.836	±	0.0235

@Yatimai Please let me know if you want me to do more testing, but its producing coherent output @fix/awq-gemma-rmsnorm.

Yatimai · 2026-03-25T17:13:35Z

Thanks @Jeevi10 for testing, appreciate the validation!

gemini-code-assist bot reviewed Mar 21, 2026

View reviewed changes

tests/llmcompressor/modifiers/awq/test_base.py Outdated Show resolved Hide resolved

Yatimai mentioned this pull request Mar 21, 2026

[Bug]: Gemma3ForCausalLM(Medgemma-27-text-it) model produces garbage output #2365

Closed

brian-dellabetta reviewed Mar 23, 2026

View reviewed changes

src/llmcompressor/modifiers/awq/base.py Outdated Show resolved Hide resolved

src/llmcompressor/modifiers/awq/base.py Outdated Show resolved Hide resolved

This was referenced Mar 23, 2026

[Bug]: When quantifying Qwe3-Next-80B-A3B using w8a8-int8, some parameters become random parameters #2059

Open

Regarding the Quantization Scheme for the Qwen-Next 80B Model W8A8 #2202

Open

brian-dellabetta self-assigned this Mar 23, 2026

Yatimai changed the title ~~fix(awq): handle unit-offset RMSNorm in smoothing for Gemma models~~ fix: add norm calibration context for unit-offset RMSNorm (Gemma/Qwen3Next) Mar 24, 2026

Yatimai force-pushed the fix/awq-gemma-rmsnorm branch from a72180e to c25f428 Compare March 24, 2026 00:42

Yatimai requested review from HDCharles, dsikka and kylesayrs as code owners March 24, 2026 00:42

brian-dellabetta reviewed Mar 24, 2026

View reviewed changes

src/llmcompressor/entrypoints/oneshot.py Outdated Show resolved Hide resolved

Yatimai force-pushed the fix/awq-gemma-rmsnorm branch from c25f428 to ebde718 Compare March 24, 2026 15:50

brian-dellabetta reviewed Mar 24, 2026

View reviewed changes

fix: add norm calibration context for unit-offset RMSNorm (Gemma/Qwen…

c9eb09d

…3Next) Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>

Yatimai force-pushed the fix/awq-gemma-rmsnorm branch from ebde718 to c9eb09d Compare March 24, 2026 19:07

Merge branch 'main' into fix/awq-gemma-rmsnorm

34cd981

brian-dellabetta added the ready When a PR is ready for review label Mar 24, 2026

brian-dellabetta approved these changes Mar 24, 2026

View reviewed changes

Yatimai closed this Mar 25, 2026

Yatimai reopened this Mar 25, 2026

Merge branch 'main' into fix/awq-gemma-rmsnorm

903cb42

HDCharles approved these changes Mar 25, 2026

View reviewed changes

HDCharles merged commit cc6a964 into vllm-project:main Mar 25, 2026
13 of 14 checks passed

Conversation

Yatimai commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

TEST PLAN

Uh oh!

gemini-code-assist bot commented Mar 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Yatimai commented Mar 24, 2026

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yatimai commented Mar 24, 2026

Uh oh!

Jeevi10 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yatimai commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Yatimai commented Mar 21, 2026 •

edited

Loading

brian-dellabetta left a comment •

edited

Loading

Jeevi10 commented Mar 25, 2026 •

edited

Loading