[AWQ] Add option to consider smooth layer quantization in scale search#2323
[AWQ] Add option to consider smooth layer quantization in scale search#2323Ramshankar07 wants to merge 16 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @Ramshankar07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the AWQ (Activation-aware Weight Quantization) algorithm by introducing an option to account for the quantization of "smooth layers" alongside "balance layers." Previously, AWQ primarily optimized scales for balance layers, assuming smooth layers were unquantized. This change addresses scenarios where both types of layers are quantized, preventing potential degradation of the smooth layer's quantization quality by including its error in the scale search objective. The modification ensures that the scale factors are chosen to minimize quantization error across all targeted layers, leading to more robust and accurate quantized models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an option to account for smooth layer quantization during the AWQ scale search, which is a valuable addition for scenarios where both smooth and balance layers are quantized. The implementation is well-structured, particularly with the extraction of logic into the _apply_balance_layer_quantization_in_grid_search helper function. I've identified a few areas for improvement: a performance optimization in _apply_smoothing, a simplification of conditional logic in _compute_best_scale, and a potential bug fix related to in-place tensor modification during the grid search. Overall, this is a solid contribution.
5a6ec4a to
e921cb8
Compare
|
The quality checks have failed. Please run |
e921cb8 to
0056712
Compare
0056712 to
5d2db17
Compare
|
I'll do the lm_eval and update by tomorrow |
|
The quality checks have failed. Please run |
f86d512 to
0a09182
Compare
0a09182 to
4ebb35d
Compare
|
i'll take a look and see, we're trying to get out a release right now though so may be a day or two. Also can you reach out to me on vllm slack under the same username? |
|
Sure. |
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
8265757 to
9addbbe
Compare
|
The quality checks have failed. Please run |
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
f5b9f29 to
546b4dc
Compare
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
SUMMARY
This PR adds an option to take smooth layer quantization into account when computing AWQ scales, e.g. for the up_proj → down_proj mapping in transformer FFN blocks.
This is PR for : #2296
Problem
AWQ picks scale factors by minimizing quantization error only for the balance layer (e.g. down_proj) and applies an inverse rescale to the smooth layer (e.g. up_proj), which is usually assumed unquantized. When both are quantization targets, the scale chosen for the balance layer can worsen quantization of the smooth layer because the smooth layer’s quantization error is not included in the objective.
Solution
smooth_layer_quantizationflag (default:False) onAWQModifier. When enabled and the smooth layer is in the quantization target list:get_lowest_common_ancestor_with_avoidover both balance layer names and the smooth layer name._compute_best_scaleis called withsmooth_layer_targeted=True. The smooth layer is added toorig_layer_weights, and during the grid search we rescale it by 1/s, quantize to Q(W/s)._smoothlogic. When the smooth layer is inorig_layer_weights, we apply rescaling fromorig_layer_weights[smooth_layer]so the stored weights are W/s for later calibration, avoiding double rescaling of the quantized grid-search weights.