Skip to content

AWQ smooth layer quantization (v2) [not for land]#2431

Draft
HDCharles wants to merge 11 commits intomainfrom
awq-smooth-layer-quantization-v2
Draft

AWQ smooth layer quantization (v2) [not for land]#2431
HDCharles wants to merge 11 commits intomainfrom
awq-smooth-layer-quantization-v2

Conversation

@HDCharles
Copy link
Collaborator

@HDCharles HDCharles commented Mar 3, 2026

i added different evaluation code along the same lines as what we used previously, didn't want to clutter the other PR so made a new one

Ramshankar07 and others added 11 commits February 27, 2026 21:42
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
Signed-off-by: Ramshankar07 <picographer0214@gmail.com>
…n scripts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 3, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@HDCharles HDCharles marked this pull request as draft March 3, 2026 15:48
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly advances AWQ quantization by integrating and refining smooth layer quantization capabilities. It introduces a dedicated option within the AWQ modifier to account for smooth layers during scale determination, supported by a unified and robust rescaling mechanism. The changes are thoroughly demonstrated through new example scripts covering a wide array of quantization schemes and MoE models, ensuring broad applicability and ease of use. Furthermore, the PR includes new tools for streamlined evaluation and consistent naming conventions, enhancing the overall testing and deployment workflow for AWQ-quantized models.

Highlights

  • Smooth Layer Quantization Option: Introduced a new smooth_layer_quantization option within the AWQModifier to enable consideration of smooth layers during the AWQ scale search process.
  • Unified Rescaling Logic: Enhanced the internal handling of smooth layers with a unified rescaling logic, improving consistency and accuracy during quantization.
  • Comprehensive Example Scripts: Added a suite of new example scripts demonstrating AWQ quantization with smooth layers across various configurations, including FP8 block, FP8 dynamic, W4A16, W4A8, and MoE variants (Qwen3 MoE, Qwen3 Coder MoE, Qwen3-Next).
  • Standardized Save Directory Naming: Fixed and standardized save directory naming conventions across all AWQ examples for better organization and clarity.
  • Evaluation and Run Scripts: Included new utility scripts for extracting evaluation results from log files and a comprehensive run.sh script to automate the execution and evaluation of AWQ examples.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/awq/awq_without_smooth.py
    • Added a Modal script to run AWQ baseline without smooth layer quantization.
  • examples/awq/fp8_block_llama_example.py
    • Updated the save directory naming convention.
  • examples/awq/fp8_block_llama_example_smooth.py
    • Added an example script for FP8 block quantization with smooth layers.
  • examples/awq/fp8_dynamic_llama_example.py
    • Updated the save directory naming convention.
  • examples/awq/fp8_dynamic_llama_example_smooth.py
    • Added an example script for FP8 dynamic quantization with smooth layers.
  • examples/awq/llama_example.py
    • Updated the save directory naming convention.
  • examples/awq/llama_example_smooth.py
    • Added an example script for W4A16 quantization with smooth layers.
  • examples/awq/lm_eval_smooth_layer_test.py
    • Added a local script to compare AWQ with and without smooth layer quantization using lm_eval.
  • examples/awq/modal_awq_runners.py
    • Added Modal application runners for AWQ baseline and smooth layer quantization, including lm_eval integration.
  • examples/awq/modal_awq_with_smooth.py
    • Added a Modal entrypoint script to execute AWQ with smooth layer quantization.
  • examples/awq/qwen3_coder_moe_example.py
    • Updated the save directory naming convention.
  • examples/awq/qwen3_coder_moe_example_smooth.py
    • Added an example script for Qwen3 Coder MoE with smooth layer quantization.
  • examples/awq/qwen3_moe_example.py
    • Updated the save directory naming convention.
  • examples/awq/qwen3_moe_example_smooth.py
    • Added an example script for Qwen3 MoE with smooth layer quantization.
  • examples/awq/qwen3_next_example.py
    • Updated the save directory naming convention.
  • examples/awq/qwen3_next_example_smooth.py
    • Added an example script for Qwen3-Next with smooth layer quantization.
  • examples/awq/w4a8_fp8_llama_example.py
    • Updated the save directory naming convention.
  • examples/awq/w4a8_fp8_llama_example_smooth.py
    • Added an example script for W4AFP8 quantization with smooth layers.
  • extract_eval_results.py
    • Added a utility script to parse and display evaluation results from log files.
  • run.sh
    • Added a comprehensive shell script to automate the execution and evaluation of various AWQ examples.
  • src/llmcompressor/modifiers/awq/base.py
    • Added smooth_layer_quantization as a configurable parameter to AWQModifier.
    • Modified the logic for resolving module mappings to correctly identify smooth layers when smooth_layer_quantization is enabled.
    • Refactored the weight smoothing application to handle smooth layer quantization during grid search.
    • Introduced a new helper function _rescale_and_fake_quantize_layer for consistent weight manipulation and quantization.
    • Updated _compute_best_scale to incorporate smooth layer quantization into the grid search for optimal scale determination.
  • tests/e2e/vLLM/recipes/WNA16/recipe_w4a16_awq_sym_with_smooth.yaml
    • Added a new recipe file to enable W4A16 AWQ symmetric quantization with smooth layer quantization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 3, 2026
@mergify
Copy link
Contributor

mergify bot commented Mar 3, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for smooth layer quantization in AWQ, including updates to the core AWQModifier logic and the addition of numerous example scripts for various models and quantization schemes. The save directory naming conventions in existing examples have also been standardized. The changes are well-structured, and the new feature is supported by comprehensive examples and testing scripts. I've identified a minor issue in one of the new test scripts where save directory names are hardcoded incorrectly, which could cause confusion.

awq_time_smooth: float | None = None

if args.without_smooth or args.both:
save_baseline = "qwen3-0.6b-w4a16-awq-baseline"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The save_baseline directory is hardcoded to a name that doesn't reflect the MODEL_ID being used (meta-llama/Meta-Llama-3-8B-Instruct). This can be confusing and is likely a copy-paste error. It's better to derive the save directory name from the MODEL_ID for clarity and consistency.

Suggested change
save_baseline = "qwen3-0.6b-w4a16-awq-baseline"
save_baseline = f"{MODEL_ID.split('/')[-1]}-w4a16-awq-baseline"

print(f"Baseline metrics: {m}")

if args.with_smooth or args.both:
save_smooth = "qwen3-0.6b-w4a16-awq-with-smooth"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to save_baseline, the save_smooth directory is hardcoded with a name that doesn't match the MODEL_ID. This should also be derived from MODEL_ID to avoid confusion.

Suggested change
save_smooth = "qwen3-0.6b-w4a16-awq-with-smooth"
save_smooth = f"{MODEL_ID.split('/')[-1]}-w4a16-awq-with-smooth"

@HDCharles HDCharles changed the title AWQ smooth layer quantization (v2) AWQ smooth layer quantization (v2) [not for land] Mar 3, 2026
w_qscheme,
)
if is_smooth_layer:
layer.weight.data = quantized.to(weight_dtype)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! While reviewing the also_quantize_smooth_layers logic in _rescale_and_fake_quantize_layer, I noticed a potential math issue.

Because of this specific check:

if is_smooth_layer:
    layer.weight.data = quantized.to(weight_dtype)
else:
    layer.weight.data = (quantized / scales_view).to(weight_dtype)

The smooth layer doesn't divide by its scales_view ($1/s$). This seems to break the simulated scaling and cause a double-scaling ($1/s^2$) issue during the simulated forward pass (_run_samples):

  1. Smooth Layer: Since the weight is fixed to $Q(W_{sm}/s)$, its physical output activation is scaled down to $\approx s^{-1} \cdot X$.
  2. Balance Layer: It receives $s^{-1} \cdot X$ from the smooth layer, but it also artificially scales down its own weights by $s$ (via the else branch).
  3. Combined Effect: The final output of the block becomes $(s^{-1} \cdot X) \cdot (\frac{Q(W_{bal} \cdot s)}{s})^T = \mathbf{s^{-2}} \cdot \mathbf{X} \cdot Q(W_{bal} \cdot s)^T$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation quality-failed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants