AWQ smooth layer quantization (v2) [not for land] by HDCharles · Pull Request #2431 · vllm-project/llm-compressor

HDCharles · 2026-03-03T15:48:31Z

i added different evaluation code along the same lines as what we used previously, didn't want to clutter the other PR so made a new one

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

…n scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-03T15:48:40Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-03-03T15:49:09Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly advances AWQ quantization by integrating and refining smooth layer quantization capabilities. It introduces a dedicated option within the AWQ modifier to account for smooth layers during scale determination, supported by a unified and robust rescaling mechanism. The changes are thoroughly demonstrated through new example scripts covering a wide array of quantization schemes and MoE models, ensuring broad applicability and ease of use. Furthermore, the PR includes new tools for streamlined evaluation and consistent naming conventions, enhancing the overall testing and deployment workflow for AWQ-quantized models.

Highlights

Smooth Layer Quantization Option: Introduced a new smooth_layer_quantization option within the AWQModifier to enable consideration of smooth layers during the AWQ scale search process.
Unified Rescaling Logic: Enhanced the internal handling of smooth layers with a unified rescaling logic, improving consistency and accuracy during quantization.
Comprehensive Example Scripts: Added a suite of new example scripts demonstrating AWQ quantization with smooth layers across various configurations, including FP8 block, FP8 dynamic, W4A16, W4A8, and MoE variants (Qwen3 MoE, Qwen3 Coder MoE, Qwen3-Next).
Standardized Save Directory Naming: Fixed and standardized save directory naming conventions across all AWQ examples for better organization and clarity.
Evaluation and Run Scripts: Included new utility scripts for extracting evaluation results from log files and a comprehensive run.sh script to automate the execution and evaluation of AWQ examples.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/awq/awq_without_smooth.py
- Added a Modal script to run AWQ baseline without smooth layer quantization.
examples/awq/fp8_block_llama_example.py
- Updated the save directory naming convention.
examples/awq/fp8_block_llama_example_smooth.py
- Added an example script for FP8 block quantization with smooth layers.
examples/awq/fp8_dynamic_llama_example.py
- Updated the save directory naming convention.
examples/awq/fp8_dynamic_llama_example_smooth.py
- Added an example script for FP8 dynamic quantization with smooth layers.
examples/awq/llama_example.py
- Updated the save directory naming convention.
examples/awq/llama_example_smooth.py
- Added an example script for W4A16 quantization with smooth layers.
examples/awq/lm_eval_smooth_layer_test.py
- Added a local script to compare AWQ with and without smooth layer quantization using lm_eval.
examples/awq/modal_awq_runners.py
- Added Modal application runners for AWQ baseline and smooth layer quantization, including lm_eval integration.
examples/awq/modal_awq_with_smooth.py
- Added a Modal entrypoint script to execute AWQ with smooth layer quantization.
examples/awq/qwen3_coder_moe_example.py
- Updated the save directory naming convention.
examples/awq/qwen3_coder_moe_example_smooth.py
- Added an example script for Qwen3 Coder MoE with smooth layer quantization.
examples/awq/qwen3_moe_example.py
- Updated the save directory naming convention.
examples/awq/qwen3_moe_example_smooth.py
- Added an example script for Qwen3 MoE with smooth layer quantization.
examples/awq/qwen3_next_example.py
- Updated the save directory naming convention.
examples/awq/qwen3_next_example_smooth.py
- Added an example script for Qwen3-Next with smooth layer quantization.
examples/awq/w4a8_fp8_llama_example.py
- Updated the save directory naming convention.
examples/awq/w4a8_fp8_llama_example_smooth.py
- Added an example script for W4AFP8 quantization with smooth layers.
extract_eval_results.py
- Added a utility script to parse and display evaluation results from log files.
run.sh
- Added a comprehensive shell script to automate the execution and evaluation of various AWQ examples.
src/llmcompressor/modifiers/awq/base.py
- Added smooth_layer_quantization as a configurable parameter to AWQModifier.
- Modified the logic for resolving module mappings to correctly identify smooth layers when smooth_layer_quantization is enabled.
- Refactored the weight smoothing application to handle smooth layer quantization during grid search.
- Introduced a new helper function _rescale_and_fake_quantize_layer for consistent weight manipulation and quantization.
- Updated _compute_best_scale to incorporate smooth layer quantization into the grid search for optimal scale determination.
tests/e2e/vLLM/recipes/WNA16/recipe_w4a16_awq_sym_with_smooth.yaml
- Added a new recipe file to enable W4A16 AWQ symmetric quantization with smooth layer quantization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

mergify · 2026-03-03T15:50:07Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

gemini-code-assist

Code Review

This pull request introduces support for smooth layer quantization in AWQ, including updates to the core AWQModifier logic and the addition of numerous example scripts for various models and quantization schemes. The save directory naming conventions in existing examples have also been standardized. The changes are well-structured, and the new feature is supported by comprehensive examples and testing scripts. I've identified a minor issue in one of the new test scripts where save directory names are hardcoded incorrectly, which could cause confusion.

gemini-code-assist · 2026-03-03T15:54:36Z

examples/awq/lm_eval_smooth_layer_test.py

+    awq_time_smooth: float | None = None
+
+    if args.without_smooth or args.both:
+        save_baseline = "qwen3-0.6b-w4a16-awq-baseline"


The save_baseline directory is hardcoded to a name that doesn't reflect the MODEL_ID being used (meta-llama/Meta-Llama-3-8B-Instruct). This can be confusing and is likely a copy-paste error. It's better to derive the save directory name from the MODEL_ID for clarity and consistency.

Suggested change

save_baseline = "qwen3-0.6b-w4a16-awq-baseline"

save_baseline = f"{MODEL_ID.split('/')[-1]}-w4a16-awq-baseline"

gemini-code-assist · 2026-03-03T15:54:36Z

examples/awq/lm_eval_smooth_layer_test.py

+                print(f"Baseline metrics: {m}")
+
+    if args.with_smooth or args.both:
+        save_smooth = "qwen3-0.6b-w4a16-awq-with-smooth"


Similar to save_baseline, the save_smooth directory is hardcoded with a name that doesn't match the MODEL_ID. This should also be derived from MODEL_ID to avoid confusion.

Suggested change

save_smooth = "qwen3-0.6b-w4a16-awq-with-smooth"

save_smooth = f"{MODEL_ID.split('/')[-1]}-w4a16-awq-with-smooth"

ANGDL · 2026-03-17T08:55:12Z

src/llmcompressor/modifiers/awq/base.py

+            w_qscheme,
+        )
+        if is_smooth_layer:
+            layer.weight.data = quantized.to(weight_dtype)


Hi! While reviewing the also_quantize_smooth_layers logic in _rescale_and_fake_quantize_layer, I noticed a potential math issue.

Because of this specific check:

if is_smooth_layer: layer.weight.data = quantized.to(weight_dtype) else: layer.weight.data = (quantized / scales_view).to(weight_dtype)

The smooth layer doesn't divide by its scales_view ($1/s$). This seems to break the simulated scaling and cause a double-scaling ($1/s^2$) issue during the simulated forward pass (_run_samples):

Smooth Layer: Since the weight is fixed to $Q(W_{sm}/s)$, its physical output activation is scaled down to $\approx s^{-1} \cdot X$.

Balance Layer: It receives $s^{-1} \cdot X$ from the smooth layer, but it also artificially scales down its own weights by $s$ (via the else branch).

Combined Effect: The final output of the block becomes $(s^{-1} \cdot X) \cdot (\frac{Q(W_{bal} \cdot s)}{s})^T = \mathbf{s^{-2}} \cdot \mathbf{X} \cdot Q(W_{bal} \cdot s)^T$.

Ramshankar07 and others added 11 commits February 27, 2026 21:42

[AWQ] Add option to consider smooth layer quantization in scale search

131f8e5

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

fix: removing redundancy and fix of copy

4da6d0e

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

chore: Resolving review

5108682

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

fix: code format

7e03460

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

chore: unified rescale function

e70bcb1

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

chore: removing redundant code

29a94cb

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

chore: awq base.py updates (smooth layer)

b67b52b

Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

refactor: enhance smooth layer handling and unify rescaling logic

a3f04e2

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

fix: dtype error

f6bfb7d

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

chore: testing experiments and configs

9addbbe

Signed-off-by: Ramshankar07 <picographer0214@gmail.com>

chore: add smooth layer examples, fix save dir names, and add eval/ru…

c1ed1d6

…n scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

HDCharles marked this pull request as draft March 3, 2026 15:48

mergify bot added the documentation Improvements or additions to documentation label Mar 3, 2026

mergify bot added the quality-failed label Mar 3, 2026

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

HDCharles changed the title ~~AWQ smooth layer quantization (v2)~~ AWQ smooth layer quantization (v2) [not for land] Mar 3, 2026

ANGDL reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ smooth layer quantization (v2) [not for land]#2431

AWQ smooth layer quantization (v2) [not for land]#2431
HDCharles wants to merge 11 commits intomainfrom
awq-smooth-layer-quantization-v2

HDCharles commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

ANGDL Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	save_baseline = "qwen3-0.6b-w4a16-awq-baseline"
	save_baseline = f"{MODEL_ID.split('/')[-1]}-w4a16-awq-baseline"

	save_smooth = "qwen3-0.6b-w4a16-awq-with-smooth"
	save_smooth = f"{MODEL_ID.split('/')[-1]}-w4a16-awq-with-smooth"

Conversation

HDCharles commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

ANGDL Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDCharles commented Mar 3, 2026 •

edited

Loading