Add Intel AutoRound algorithm support by yiliu30 · Pull Request #8 · yiliu30/llm-compressor-fork

yiliu30 · 2025-11-03T08:22:48Z

Address [todo: issue number]

Highlights

Introduced AutoRoundModifier to enable AutoRound quantization for wNa16.
Added an end-to-end example and unit tests.
Verified functionality with local accuracy tests (GSM8K with a limit of 1000, the results may fluctuate due to non-determinism.)

- AutoRound result as ref
vllm (pretrained=/storage/yiliu7/meta-llama/Meta-Llama-3-8B-Instruct-ar/Meta-Llama-3-8B-Instruct-w4g128/,tensor_parallel_size=1,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.739|±  |0.0139|
|     |       |strict-match    |     5|exact_match|↑  |0.740|±  |0.0139|

LLMC-AutoRound
vllm (pretrained=/storage/yiliu7/Meta-Llama-3-8B-Instruct-W4A16-G128-disbale-shuffule,tensor_parallel_size=1,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.737|±  |0.0139|
|     |       |strict-match    |     5|exact_match|↑  |0.736|±  |0.0139|

LLMC-GPTQ
vllm (pretrained=Meta-Llama-3-8B-Instruct-W4A16-G128-GPTQ,tensor_parallel_size=1,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.723|±  |0.0142|
|     |       |strict-match    |     5|exact_match|↑  |0.723|±  |0.0142|

Next stage (in later PRs)

Extend support for additional data types.
Add group-wise quantization recipes mapping between LLMC and AutoRound.
Add end-to-end tests.

Signed-off-by: yiliu30 <yi4.liu@intel.com>

github-actions · 2025-11-03T08:22:56Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-11-03T11:25:50Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for AutoRound quantization by adding a new AutoRoundModifier, an example script, and corresponding tests. My review focuses on improving the portability and clarity of the new example script and addressing several issues within the AutoRoundModifier implementation. Key feedback includes removing hardcoded paths and values, generalizing model-specific logic, and completing the handling of quantization parameters to ensure correctness and reusability.

src/llmcompressor/modifiers/quantization/autoround/base.py

examples/quantization_w4a16/auto_round_llama3_example.py

src/llmcompressor/modifiers/quantization/autoround/base.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-11-03T13:14:59Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for the AutoRound quantization algorithm, including a new AutoRoundModifier, an example script, and tests. The core implementation is a great start, but there are several areas that need improvement for robustness and usability. Specifically, the example scripts contain hardcoded paths and development artifacts that should be cleaned up. The AutoRoundModifier implementation has hardcoded values for the quantization scheme and device, which limits its flexibility. There is also a critical TODO regarding weight updates that needs to be addressed to ensure correctness. My review includes suggestions to address these points.

src/llmcompressor/modifiers/quantization/autoround/base.py

examples/quantization_w4a16/auto_round_llama3_example.py

examples/quantization_w4a16/llama3_example.py

src/llmcompressor/modifiers/quantization/autoround/base.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-11-05T00:50:10Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for AutoRound quantization by adding an AutoRoundModifier. The changes include the modifier implementation, a new dependency, an example script, and tests. Overall, the implementation is solid, but there are a few issues to address. The example scripts contain hardcoded paths and typos that should be cleaned up. More importantly, the AutoRoundModifier implementation has a potential bug in handling multiple configuration groups and appears to be missing logic for updating the weight_zero_point after quantization. I've provided specific comments and suggestions to address these points.

examples/quantization_w4a16/llama3_example.py

src/llmcompressor/modifiers/quantization/autoround/base.py

examples/quantization_w4a16/auto_round_llama3_example.py

setup.py

src/llmcompressor/modifiers/quantization/autoround/base.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-11-05T01:04:37Z

/gemini summary

gemini-code-assist · 2025-11-05T01:04:57Z

Summary of Changes

This pull request integrates the AutoRound quantization algorithm into the llmcompressor framework, providing a new modifier for users to apply this method. The core implementation of the AutoRoundModifier is included, leveraging the auto_round library for 4-bit weight quantization. The changes also encompass an example script demonstrating its usage with a Llama 3 model, along with corresponding tests, and updates to the system's handling of quantization-aware calibration to properly support the new modifier.

Highlights

New AutoRound Quantization Modifier: Introduced the AutoRoundModifier to enable the AutoRound quantization algorithm within the llmcompressor framework, allowing for 4-bit weight quantization.
Integration of auto_round Library: The auto_round library, specifically from a GitHub branch, has been added as a new dependency to support the AutoRound functionality.
Example and Testing: A new example script demonstrates how to apply AutoRound quantization to a Llama 3 8B Instruct model, and a dedicated test suite verifies the modifier's functionality with a TinyLlama model.
Quantization-Aware Calibration Handling: The AutoRoundModifier has been added to the list of modifiers that disable quantization-aware calibration in sequential pipelines, ensuring compatibility with the AutoRound process.

Changelog

examples/quantization_w4a16/auto_round_llama3_example.py
- Added a new example script showcasing AutoRound 4-bit weight quantization for the Llama 3 8B Instruct model.
setup.py
- Added the auto_round library as a dependency, pointing to a specific GitHub branch.
src/llmcompressor/modifiers/quantization/init.py
- Updated to export the newly added autoround module.
src/llmcompressor/modifiers/quantization/autoround/init.py
- Added an __init__.py file for the new autoround package.
src/llmcompressor/modifiers/quantization/autoround/base.py
- Implemented the AutoRoundModifier class, which orchestrates the layer-by-layer application of the AutoRound quantization algorithm.
src/llmcompressor/pipelines/layer_sequential/pipeline.py
- Modified to include AutoRoundModifier in the list of modifiers that disable quantization-aware calibration.
src/llmcompressor/pipelines/sequential/pipeline.py
- Modified to include AutoRoundModifier in the list of modifiers that disable quantization-aware calibration.
src/llmcompressor/utils/helpers.py
- Defined DISABLE_QAC_MODIFIERS and added AutoRoundModifier to this list.
tests/llmcompressor/transformers/autoround/test_oneshot.py
- Added new tests to validate the oneshot application of AutoRoundModifier on a TinyLlama model.

Activity

The pull request was initiated by yiliu30.
A bot requested the 'ready' label for review.
yiliu30 requested Gemini reviews multiple times.
Gemini Code Assist provided several review comments, highlighting critical issues such as hardcoded model structure access, hardcoded arguments for auto_round.AutoRound, and incomplete handling of quantization parameters.
High-priority feedback included issues with hardcoded device settings, the need for assertions for single config groups, and clarification on weight updates.
Medium-priority comments addressed confusing example scripts with hardcoded paths and overwritten variables, a fragile _is_decoding_layer function, incorrect docstrings, local auto_round imports without error handling, and the use of a specific GitHub branch for a dependency.

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-11-05T03:14:31Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for the Intel AutoRound quantization algorithm, which is a valuable addition. The implementation includes the core AutoRoundModifier, an end-to-end example, and unit tests. The code is generally well-structured, but I've identified a few areas for improvement related to correctness and robustness, particularly concerning the handling of quantization parameters and module attribute modifications.

src/llmcompressor/modifiers/quantization/autoround/base.py

setup.py

gemini-code-assist · 2025-11-05T03:16:07Z

src/llmcompressor/modifiers/quantization/autoround/base.py

+        for name, mod in state.model.named_modules():
+            mod._tmp_name = name


Adding a temporary attribute _tmp_name to every module in the model can be fragile and may cause attribute name collisions. A more robust approach would be to avoid modifying the modules directly. For instance, you could use a dictionary that maps id(module) to its name, or use a closure in the hook registration to capture the module name. This would make the implementation cleaner and less intrusive.

src/llmcompressor/modifiers/quantization/autoround/base.py

examples/quantization_w4a16/auto_round_llama3_example.py

src/llmcompressor/modifiers/quantization/autoround/base.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

src/llmcompressor/pipelines/layer_sequential/pipeline.py

src/llmcompressor/pipelines/sequential/pipeline.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-11-13T00:49:34Z

1994
Formal PR was merged, close it

yiliu30 added 9 commits November 2, 2025 20:49

add auto-round

80c92da

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'main' into up-ar

75f7efd

add auto-round modifier

3266b79

Signed-off-by: yiliu30 <yi4.liu@intel.com>

refine code

9c537cc

Signed-off-by: yiliu30 <yi4.liu@intel.com>

disbale qac for auto-round

bebe0fa

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean code

dfb0ff8

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add compile after disable qac

513972c

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add iters and clean code

2291cc4

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean code

4028853

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 8 commits November 3, 2025 00:23

add example

97ff9e0

Signed-off-by: yiliu30 <yi4.liu@intel.com>

refine docs

cb7a5b4

Signed-off-by: yiliu30 <yi4.liu@intel.com>

refine example

5a7500e

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add init

d02a355

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean code

cea9d2f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

format

22be9b7

Signed-off-by: yiliu30 <yi4.liu@intel.com>

refactor

6cdb402

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add ut

e2814eb

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 marked this pull request as ready for review November 3, 2025 11:20

gemini-code-assist bot reviewed Nov 3, 2025

View reviewed changes

test llama 3

3e4a9fc

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist bot reviewed Nov 3, 2025

View reviewed changes

yiliu30 added 6 commits November 3, 2025 19:22

clean code

aa34b65

Signed-off-by: yiliu30 <yi4.liu@intel.com>

parse layer-wise config

afe2ff7

Signed-off-by: yiliu30 <yi4.liu@intel.com>

format

8e9eccc

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add docstring

81f76af

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add ar

afa6150

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update example

97217e7

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 2 commits November 4, 2025 16:42

align api

3dcb434

Signed-off-by: yiliu30 <yi4.liu@intel.com>

format

aef7707

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

yiliu30 added 3 commits November 4, 2025 16:57

clean code

97e1ca2

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix typo

c75c272

Signed-off-by: yiliu30 <yi4.liu@intel.com>

small iters for ut

3d8a0c8

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 2 commits November 4, 2025 17:12

format

6729a75

Signed-off-by: yiliu30 <yi4.liu@intel.com>

refine comment

bb4dbe8

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 changed the title ~~Up ar~~ Add Intel AutoRound algorithm support Nov 5, 2025

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

hshen14 reviewed Nov 5, 2025

View reviewed changes

examples/quantization_w4a16/auto_round_llama3_example.py Show resolved Hide resolved

hshen14 reviewed Nov 5, 2025

View reviewed changes

src/llmcompressor/modifiers/quantization/autoround/base.py Outdated Show resolved Hide resolved

replace papaer link

2adf0e7

Signed-off-by: yiliu30 <yi4.liu@intel.com>

thuang6 reviewed Nov 5, 2025

View reviewed changes

src/llmcompressor/pipelines/layer_sequential/pipeline.py Outdated Show resolved Hide resolved

thuang6 reviewed Nov 5, 2025

View reviewed changes

src/llmcompressor/pipelines/sequential/pipeline.py Outdated Show resolved Hide resolved

correct comments

e44c93e

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 closed this Nov 13, 2025

		for name, mod in state.model.named_modules():
		mod._tmp_name = name

Conversation

yiliu30 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Highlights

Next stage (in later PRs)

Uh oh!

github-actions bot commented Nov 3, 2025

Uh oh!

yiliu30 commented Nov 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiliu30 commented Nov 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiliu30 commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiliu30 commented Nov 5, 2025

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Summary of Changes

Highlights

Uh oh!

yiliu30 commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiliu30 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

yiliu30 commented Nov 3, 2025 •

edited

Loading

yiliu30 commented Nov 13, 2025 •

edited

Loading