Skip to content

[AWQ] Add option to consider smooth layer quantization in scale search#2323

Open
Ramshankar07 wants to merge 16 commits intovllm-project:mainfrom
Ramshankar07:awq-smooth-layer-quantization
Open

[AWQ] Add option to consider smooth layer quantization in scale search#2323
Ramshankar07 wants to merge 16 commits intovllm-project:mainfrom
Ramshankar07:awq-smooth-layer-quantization

Conversation

@Ramshankar07
Copy link

SUMMARY

This PR adds an option to take smooth layer quantization into account when computing AWQ scales, e.g. for the up_proj → down_proj mapping in transformer FFN blocks.

This is PR for : #2296

Problem

AWQ picks scale factors by minimizing quantization error only for the balance layer (e.g. down_proj) and applies an inverse rescale to the smooth layer (e.g. up_proj), which is usually assumed unquantized. When both are quantization targets, the scale chosen for the balance layer can worsen quantization of the smooth layer because the smooth layer’s quantization error is not included in the objective.

Solution

  • smooth_layer_quantization flag (default: False) on AWQModifier. When enabled and the smooth layer is in the quantization target list:
  • Ancestor used for forward / error:** The module used for running samples (and thus for the quantization error) is chosen via get_lowest_common_ancestor_with_avoid over both balance layer names and the smooth layer name.
  • Scale search: _compute_best_scale is called with smooth_layer_targeted=True. The smooth layer is added to orig_layer_weights, and during the grid search we rescale it by 1/s, quantize to Q(W/s).
  • Applying scales: After the best scale is chosen, the inverse rescale is applied to the smooth layer (and balance layers) in the existing _smooth logic. When the smooth layer is in orig_layer_weights, we apply rescaling from orig_layer_weights[smooth_layer] so the stored weights are W/s for later calibration, avoiding double rescaling of the quantized grid-search weights.

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Ramshankar07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AWQ (Activation-aware Weight Quantization) algorithm by introducing an option to account for the quantization of "smooth layers" alongside "balance layers." Previously, AWQ primarily optimized scales for balance layers, assuming smooth layers were unquantized. This change addresses scenarios where both types of layers are quantized, preventing potential degradation of the smooth layer's quantization quality by including its error in the scale search objective. The modification ensures that the scale factors are chosen to minimize quantization error across all targeted layers, leading to more robust and accurate quantized models.

Highlights

  • New Configuration Option: Introduced a new smooth_layer_quantization flag in AWQModifier to enable considering the smooth layer during AWQ scale computation, addressing scenarios where both smooth and balance layers are targeted for quantization.
  • Enhanced Ancestor Search: Updated the common ancestor search logic (get_lowest_common_ancestor_with_avoid) to include the smooth layer when smooth_layer_quantization is enabled and the smooth layer is a quantization target, ensuring a more relevant common parent for forward passes.
  • Improved Scale Computation: Modified the scale computation (_compute_best_scale) to incorporate the smooth layer's quantization error into the objective function during grid search, leading to more optimal scales when the smooth layer is also quantized.
  • Correct Weight Handling: Ensured correct weight handling for the smooth layer during the final smoothing application (_smooth), preventing double rescaling by using original weights when smooth_layer_quantization is active.
  • Code Refactoring: Refactored the balance layer quantization logic within the grid search into a dedicated helper method, _apply_balance_layer_quantization_in_grid_search, for improved code clarity and maintainability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Ramshankar07 Ramshankar07 marked this pull request as draft January 31, 2026 18:27
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an option to account for smooth layer quantization during the AWQ scale search, which is a valuable addition for scenarios where both smooth and balance layers are quantized. The implementation is well-structured, particularly with the extraction of logic into the _apply_balance_layer_quantization_in_grid_search helper function. I've identified a few areas for improvement: a performance optimization in _apply_smoothing, a simplification of conditional logic in _compute_best_scale, and a potential bug fix related to in-place tensor modification during the grid search. Overall, this is a solid contribution.

@Ramshankar07 Ramshankar07 force-pushed the awq-smooth-layer-quantization branch 2 times, most recently from 5a6ec4a to e921cb8 Compare February 1, 2026 01:34
@mergify
Copy link
Contributor

mergify bot commented Feb 2, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your contribution! overall looks good but see comments.

otherwise we should run some of the eval tests and see the results, let me know if you need help with that

@Ramshankar07 Ramshankar07 force-pushed the awq-smooth-layer-quantization branch from e921cb8 to 0056712 Compare February 3, 2026 18:25
@mergify mergify bot removed the quality-failed label Feb 3, 2026
@Ramshankar07 Ramshankar07 force-pushed the awq-smooth-layer-quantization branch from 0056712 to 5d2db17 Compare February 3, 2026 18:27
@Ramshankar07
Copy link
Author

Ramshankar07 commented Feb 3, 2026

I'll do the lm_eval and update by tomorrow

@mergify
Copy link
Contributor

mergify bot commented Feb 4, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@Ramshankar07 Ramshankar07 force-pushed the awq-smooth-layer-quantization branch from f86d512 to 0a09182 Compare February 4, 2026 18:08
@mergify mergify bot removed the quality-failed label Feb 4, 2026
@Ramshankar07 Ramshankar07 force-pushed the awq-smooth-layer-quantization branch from 0a09182 to 4ebb35d Compare February 4, 2026 18:10
@HDCharles HDCharles self-requested a review February 25, 2026 15:14
@HDCharles
Copy link
Collaborator

i'll take a look and see, we're trying to get out a release right now though so may be a day or two. Also can you reach out to me on vllm slack under the same username?

@Ramshankar07
Copy link
Author

Sure.

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
@Ramshankar07 Ramshankar07 force-pushed the awq-smooth-layer-quantization branch from 8265757 to 9addbbe Compare February 28, 2026 02:43
@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 28, 2026
@dsikka dsikka added ready When a PR is ready for review awq For any issue / PR related to AWQ support labels Mar 2, 2026
@mergify
Copy link
Contributor

mergify bot commented Mar 2, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Ramshankar07 and others added 10 commits March 2, 2026 19:52
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
@Ramshankar07 Ramshankar07 force-pushed the awq-smooth-layer-quantization branch from f5b9f29 to 546b4dc Compare March 3, 2026 00:53
@HDCharles HDCharles assigned HDCharles and unassigned HDCharles Mar 9, 2026
HDCharles and others added 3 commits March 10, 2026 02:01
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awq For any issue / PR related to AWQ support documentation Improvements or additions to documentation ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants