Skip to content

fix: add norm calibration context for unit-offset RMSNorm (Gemma/Qwen3Next)#2500

Merged
HDCharles merged 3 commits intovllm-project:mainfrom
Yatimai:fix/awq-gemma-rmsnorm
Mar 25, 2026
Merged

fix: add norm calibration context for unit-offset RMSNorm (Gemma/Qwen3Next)#2500
HDCharles merged 3 commits intovllm-project:mainfrom
Yatimai:fix/awq-gemma-rmsnorm

Conversation

@Yatimai
Copy link
Contributor

@Yatimai Yatimai commented Mar 21, 2026

SUMMARY

Some architectures (Gemma, Gemma2, Gemma3, Qwen3Next) use an offset normalization where the forward computes output * (1 + weight) instead of output * weight. This breaks any modifier that smooths norm weights (AWQ, SmoothQuant, SpinQuant, QuIP) because dividing a (1+weight) parameter by scales produces 1 + weight/scales instead of (1 + weight)/scales.

Following @brian-dellabetta's suggestion, this adds a norm_calibration_context that temporarily replaces offset-norm modules with standard-norm equivalents during calibration, following the same pattern as moe_calibration_context. On entry, offset norms are replaced with CalibrationOffsetNorm modules (weight = 1 + original). On exit, modules are restored with updated weights (weight = smoothed - 1).

Only norms operating on hidden_size are converted. Norms operating on head_dim (e.g. q_norm/k_norm in Gemma3 attention) are skipped since no modifier smooths them.

TEST PLAN

Unit tests (8/8 passing):

  • Weight conversion and dtype preservation
  • Forward equivalence with original norm
  • Restore roundtrip (with and without smoothing)
  • Registry detection (positive and negative)
  • hidden_size filter: q_norm/k_norm correctly skipped

E2E validation:

Model Modifier Norms converted Output
google/gemma-2-2b-it AWQ W4A16 105 Coherent
google/medgemma-27b-text-it AWQ W4A16 249 (373 total, 124 q/k skipped) Coherent
upstream (no fix) on medgemma AWQ W4A16 0 Garbage

Qwen3-Next architecture verified structurally: hidden_size=2048, head_dim=256, Qwen3NextRMSNorm uses same (1+weight) pattern. No smaller Qwen3-Next model exists for e2e testing (80B MoE only).

Fixes #2365
Fixes #2102
Related to #2202
Related to #2059

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue in the AWQ quantization process that specifically affected Gemma models. It introduces a specialized handling for Gemma's unique RMSNorm implementation, which computes output * (1 + weight). By correctly adjusting the norm weights during the smoothing phase, the PR ensures that Gemma models can be properly quantized with AWQ, preventing the generation of garbage output and maintaining model coherence.

Highlights

  • Corrected Gemma RMSNorm Handling: Implemented a specific weight transformation for Gemma-family RMSNorms during AWQ smoothing to correctly account for their (1 + weight) computation, which was previously causing incorrect scaling.
  • Unit-Offset Norm Detection: Introduced a mechanism to identify Gemma, Gemma2, and Gemma3 RMSNorms by their class names, as these models use a unit-offset convention.
  • Validation and Regression Testing: Verified the fix on google/medgemma-27b-text-it (Gemma3), demonstrating coherent output with the fix, and confirmed no regression on Qwen2.5-0.5B.
  • New Unit Tests: Added unit tests for the _uses_unit_offset_norm function to ensure accurate detection of unit-offset norms and proper differentiation from standard norms.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-implemented fix for handling unit-offset RMSNorm in Gemma models during AWQ smoothing. The logic to detect and apply the correct scaling transformation for Gemma-family norms is sound and cleanly encapsulated. The addition of unit tests is also a valuable contribution. I've included one suggestion to enhance the test coverage for the new functionality.

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Yatimai , thanks for taking an initial stab at this, this has been on the docket but haven't had a chance to open a PR yet. I've left some comments, main concern is that this fix is specific to AWQ whereas this custom norm forward implementation would break other modifiers as well. A general solution would be preferred

@brian-dellabetta brian-dellabetta self-assigned this Mar 23, 2026
@Yatimai Yatimai changed the title fix(awq): handle unit-offset RMSNorm in smoothing for Gemma models fix: add norm calibration context for unit-offset RMSNorm (Gemma/Qwen3Next) Mar 24, 2026
@Yatimai
Copy link
Contributor Author

Yatimai commented Mar 24, 2026

Update: reworked the fix based on @brian-dellabetta's review, instead of patching AWQ only, it's now a norm_calibration_context that handles the (1 + weight) conversion for all modifiers (AWQ, SmoothQuant, SpinQuant, QuIP). Tested e2e on medgemma-27b-text-it, output is coherent after AWQ W4A16.

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Yatimai , thanks for this, things are looking really good. I have one nit, if we can patch that and I'll run it on my side before approving and adding reviewers

@Yatimai Yatimai force-pushed the fix/awq-gemma-rmsnorm branch from c25f428 to ebde718 Compare March 24, 2026 15:50
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Yatimai , sorry a few more nits after 2nd pass. Will try this out now

…3Next)

Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>
@Yatimai Yatimai force-pushed the fix/awq-gemma-rmsnorm branch from ebde718 to c9eb09d Compare March 24, 2026 19:07
@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label Mar 24, 2026
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yatimai for the contribution! This looks great to me. We may ultimately want to merge the moe and norm calibration contexts, as they share the same interface, but we can leave that off for now. Approving and will ping team for a second reviewer.

I was able to run AWQ on google/gemma-2-2b-it on this branch and get reasonable output, whereas on main it's gibberish

@Yatimai
Copy link
Contributor Author

Yatimai commented Mar 24, 2026

Thanks @brian-dellabetta for your help!

@Jeevi10
Copy link

Jeevi10 commented Mar 25, 2026

hey,
I ran an experiment with GPTQ and AWQ for google/medgemma-27b-text-it

google/medgemma-27b-text-it

Tasks Version Filter n-shot Metric Value Stderr
careqa_en 1 none 5 acc 0.856 ± 0.0222

I used W4A16 for both GPTQ and AWQ

AWQ result's in fix/awq-gemma-rmsnorm in this branch

Tasks Version Filter n-shot Metric Value Stderr
careqa_en 1 none 5 acc 0.692 ± 0.0293

GPTQ results in main branch

Tasks Version Filter n-shot Metric Value Stderr
careqa_en 1 none 5 acc 0.836 ± 0.0235

@Yatimai Please let me know if you want me to do more testing, but its producing coherent output @fix/awq-gemma-rmsnorm.

@Yatimai Yatimai closed this Mar 25, 2026
@Yatimai Yatimai reopened this Mar 25, 2026
@Yatimai
Copy link
Contributor Author

Yatimai commented Mar 25, 2026

Thanks @Jeevi10 for testing, appreciate the validation!

@HDCharles HDCharles merged commit cc6a964 into vllm-project:main Mar 25, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready When a PR is ready for review