Implement `QuantizationMixin` by kylesayrs · Pull Request #1351 · vllm-project/llm-compressor

kylesayrs · 2025-04-15T15:13:53Z

Purpose

Abstract functionality which allows modifiers to act as quantization configs into a mixin called QuantizationMixin
- This gives Pipeline Extraction #1279 an interface to properly infer which pipeline to use based on the recipe (if a recipe contains modifiers requires calibration, then use the "basic" or "sequential" pipelines)
- This enables future modifiers to act as quantization modifiers (in the same way that GPTQ does now)
Related to problem cause when doing w8a8kvfp8 for qwen2.5-vl #1354 where previous logic would attempt to add a QuantizedKVCache for dynamic kv_quant

Changes

Implement QuantizationMixin which implements five public methods
- Lifecycle methods
  - initialize_quantization is used to apply a config and attach observers to a model
    - quantization is disabled so that modules aren't quantized before they're calibrated
  - start_calibration is used to initialize calibration hooks and status
    - quantization is enabled, since we currently quantize as we calibrate, although this decision is somewhat arbitrary
  - end_calibration is used to remove calibration hooks and apply the frozen status
    - quantization remains enabled, since we want future forward passes to simulate quantization
- Recipe-related methods
  - has_config returns true if a config was specified, used for checking against duplicate configs in the recipe
  - resolve_quantization_config returns the quantization config specified by the modifier fields
QuantizationModifier inherits from QuantizationMixin
GPTQModifier inherits from QuantizationMixin
- Unlike QMod, GPTQ disables quantization during calibration. As noted before, this is a somewhat arbitrary choice but one which matches the current implementation
Calibration utils
- Replace set_unset_kv_cache with initialize_quantized_kv_cache and freeze_module_quantization
  - Treat the QuantizedKVCache as analogous to another observer
- Pull setting the calibration status out ofupdate_weight_zp_scale
  - This better matches the lifecycle detailed in QuantizationMixin description
- Implement reset_quantization_status which is used to remove any existing quantization configs before the current config is applied by initialize_quantization

Remove Support

Removing support for recipe with multiple quantization modifiers active at the same time (a check for this will be added by Pipeline Extraction #1279)
Remove num_calibration_steps, quantize, disable_quantization_observer_epoch and min_tokens_per_module
- num_calibration_steps is already controlled by
  
  llm-compressor/src/llmcompressor/datasets/utils.py
  
  Line 106 in 42b62f5
  
  num_calibration_samples=dataset_args.num_calibration_samples,
- quantize was implemented as a workaround for GPTQ's modifier builder. Similar functionality may be require to support SpinQuant + GPTQ, but such functionality should exist at a higher level
- disable_quantization_observer_epoch seems to implement functionality where a model's observers are removed but quantization remains active. This functionality is maintained by setting an "end" epoch for qmod
- min_tokens_per_module requires that the modifier have references to the calibration dataset, which is disallowed by Pipeline Extraction #1279. This information is already printed in GPTQ's logs. If research still wants this tool specifically for QuantizationModifier, then it can be reimplemented to avoid using references to the calibration dataset

Testing

Updated tests to reflect new mixin
Ran a set of GPTQ and QuantizationModifier examples to completion
CI tests pass

github-actions · 2025-04-15T15:14:03Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta

This LGTM! slowly understanding more

coolKeen · 2025-04-17T05:04:39Z

Purpose

Abstract functionality which allows modifiers to act as quantization configs into a mixin called QuantizationMixin

This gives Pipeline Extraction #1279 an interface to properly infer which pipeline to use based on the recipe (if a recipe contains modifiers requires calibration, then use the "basic" or "sequential" pipelines)

This enables future modifiers to act as quantization modifiers (in the same way that GPTQ does now)

Related to problem cause when doing w8a8kvfp8 for qwen2.5-vl #1354 where previous logic would attempt to add a QuantizedKVCache for dynamic kv_quant

Changes

Implement QuantizationMixin which implements four public methods

attach_scheme_and_observers is used to apply the quantization config to the model and attach observers

register_calibration_hooks add calibration hooks which calibrate the observers which calibrate the scales

resolve_quantization_config returns the quantization config specified by the modifier fields

has_config returns true if a config was specified, used for checking against duplicate configs

QuantizationModifier inherits from QuantizationMixin

The scheme and observers are attached on initialization

The activations and weights are calibrated on start via hooks and update_weight_zp_scale

Add a tqdm for weight calibration, which is useful for larger models

The observers and hooks are removed on finalize

GPTQModifier inherits from QuantizationMixin

Implements a similar lifecycle to QuantizationModifier

Remove attached QuantizationModifier logic

Replace set_unset_kv_cache with initialize_quantized_kv_cache and freeze_module_quantization

Treat the QuantizedKVCache as analogous to another observer

Pull setting the calibration status out ofupdate_weight_zp_scale

This better matches the lifecycle detailed in QuantizationMixin description

Remove Support

Removing support for recipe with multiple quantization modifiers active at the same time (a check for this will be added by Pipeline Extraction #1279)

Remove num_calibration_steps, quantize, disable_quantization_observer_epoch and min_tokens_per_module

num_calibration_steps is already controlled by

llm-compressor/src/llmcompressor/datasets/utils.py

Line 106 in 42b62f5

num_calibration_samples=dataset_args.num_calibration_samples,

quantize was implemented as a workaround for GPTQ's modifier builder. Similar functionality may be require to support SpinQuant + GPTQ, but such functionality should exist at a higher level

disable_quantization_observer_epoch seems to implement functionality where a model's observers are removed but quantization remains active. This is currently implemented by setting an "end" epoch

min_tokens_per_module requires that the modifier have references to the calibration dataset, which is disallowed by Pipeline Extraction #1279. This information is already printed in GPTQ's logs. If research still wants this tool specifically for QuantizationModifier, then it can be reimplemented to avoid using references to the calibration dataset

Testing

Updated tests to reflect new mixin

Ran a set of GPTQ and QuantizationModifier examples to completion

CI tests pass

Hi, I see that you mention SpinQuant + GPTQ? Does llm compressor support such recipe now?

kylesayrs · 2025-04-17T15:55:19Z

Hi @coolKeen! Transforms support is actively being worked on, you can see the WIP PRs here!

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

src/llmcompressor/modifiers/quantization/calibration.py

src/llmcompressor/modifiers/quantization/gptq/base.py

rahul-tuli

LGTM pending comment; Thank you, nice work!

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…ation-mixin

brian-dellabetta

Just make sure your other PR goes in first, or drop the changes to kv cache tests from this one

.github/workflows/test-check-transformers.yaml

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…ation-mixin

brian-dellabetta

good stuff. always nice when adding an abstraction leads to more lines removed than added 🔥

rahul-tuli

Synced offline on scope of testing; Changes introduced by this diff are tested by #1279 by virtue of lm_eval tests! Changes look good.

tests/llmcompressor/modifiers/calibration/test_kv_cache.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs · 2025-05-01T17:24:44Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14779544427

Addressed

## Purpose ## * Abstract functionality which allows modifiers to act as quantization configs into a mixin called `QuantizationMixin` * This gives #1279 an interface to properly infer which pipeline to use based on the recipe (if a recipe contains modifiers requires calibration, then use the "basic" or "sequential" pipelines) * This enables future modifiers to act as quantization modifiers (in the same way that GPTQ does now) * Related to #1354 where previous logic would attempt to add a QuantizedKVCache for dynamic kv_quant ## Changes ## * Implement `QuantizationMixin` which implements five public methods * Lifecycle methods * `initialize_quantization` is used to apply a config and attach observers to a model * quantization is disabled so that modules aren't quantized before they're calibrated * `start_calibration` is used to initialize calibration hooks and status * quantization is enabled, since we currently quantize as we calibrate, although this decision is somewhat arbitrary * `end_calibration` is used to remove calibration hooks and apply the frozen status * quantization remains enabled, since we want future forward passes to simulate quantization * Recipe-related methods * `has_config` returns true if a config was specified, used for checking against duplicate configs in the recipe * `resolve_quantization_config` returns the quantization config specified by the modifier fields * `QuantizationModifier` inherits from `QuantizationMixin` * `GPTQModifier` inherits from `QuantizationMixin` * Unlike QMod, GPTQ disables quantization during calibration. As noted before, this is a somewhat arbitrary choice but one which matches the current implementation * Calibration utils * Replace `set_unset_kv_cache` with `initialize_quantized_kv_cache` and `freeze_module_quantization` * Treat the `QuantizedKVCache` as analogous to another observer * Pull setting the calibration status out of`update_weight_zp_scale` * This better matches the lifecycle detailed in `QuantizationMixin` description * Implement `reset_quantization_status` which is used to remove any existing quantization configs before the current config is applied by `initialize_quantization` ## Remove Support ## * Removing support for recipe with multiple quantization modifiers active at the same time (a check for this will be added by #1279) * Remove `num_calibration_steps`, `quantize`, `disable_quantization_observer_epoch` and `min_tokens_per_module` * `num_calibration_steps` is already controlled by https://github.com/vllm-project/llm-compressor/blob/42b62f5283d0234b26623fe1f1bf02a77c6e4019/src/llmcompressor/datasets/utils.py#L106 * `quantize` was implemented as a workaround for GPTQ's modifier builder. Similar functionality may be require to support SpinQuant + GPTQ, but such functionality should exist at a higher level * `disable_quantization_observer_epoch` seems to implement functionality where a model's observers are removed but quantization remains active. This functionality is maintained by setting an "end" epoch for qmod * `min_tokens_per_module` requires that the modifier have references to the calibration dataset, which is disallowed by #1279. This information is already printed in GPTQ's logs. If research still wants this tool specifically for `QuantizationModifier`, then it can be reimplemented to avoid using references to the calibration dataset ## Testing ## * Updated tests to reflect new mixin * Ran a set of GPTQ and QuantizationModifier examples to completion * CI tests pass --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

## Purpose ## * Extract data pipelines from modifiers to enable multiple modifiers to be active at the same time * This enables faster compression of larger models * This enables more memory efficient compression of larger models (not limited to just GPTQ/SGPT) ## Prerequisites ## * #1351 * #1298 ## Callback Changes ## * Implement `calibration_epoch_start` * This callback should be called at the start of every calibration pipeline * This callback causes modifiers to attach hooks * Implement `sequential_epoch_end` * This callback should be called after one sequential layer has been calibrated with one epoch * This callback triggers compression and replaces passing a `callback_modifier` * Implement `calibration_epoch_end` * This callback triggers at the end of a calibration epoch, and is used to *trigger compression* in between pipelines composed using the independent pipeline and *remove hooks* in between independent pipelines ## Lifecycle Changes ## * Oneshot modifiers implement on_end, which removes hooks when calibration finishes * In the future, calibration_epoch_start is treated like batch_start, where it is an opportunity for modifiers to start * In the future, calibration_epoch_end is treated like batch_end, where it is an opportunity for modifiers to end * In the future, finalize is treated like batch_end, where it is an opportunity for modifiers to end * Right now, these opportunities are implemented manually on each oneshot modifier, rather than being a lifecycle rule ## Data Pipeline Changes ## * Implement data pipeline registry * Inferred pipeline is selected using modifiers and can be overridden by user * Implement independent pipeline * This pipeline treats each modifier as a separate stage and assigns a pipeline to each modifier * Meant to replicate current LC behavior * Originally, these compression events were triggered by reaching the end of each module’s initialize function. Now a separate event is required * Implement `session.get_modifiers` * In order to perform data pipeline inference and other sequential pipeline inference, these functions must get the list of active modifiers before they initialize * This function gets all the active modifiers across all `ModifierStages` * Prepare smoothquant for pipeline extraction * Trigger `_apply_smoothing` on the `sequential_epoch_end ` and `calibration_epoch_end` * Add a [guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285) which allows the `_apply_smoothing` function to be called multiple times per session (as is required by sequential pipeline) ## Testing ## * Quantized llama3-8b using both the independent (basic + sequential) and sequential pipelines * There was no accuracy regression from using a shared pipeline, although we keep the `independent` pipeline as the default for now * Transformers tests pass * https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074 --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

## Purpose ## * Extract data pipelines from modifiers to enable multiple modifiers to be active at the same time * This enables faster compression of larger models * This enables more memory efficient compression of larger models (not limited to just GPTQ/SGPT) ## Prerequisites ## * #1351 * #1298 ## Callback Changes ## * Implement `calibration_epoch_start` * This callback should be called at the start of every calibration pipeline * This callback causes modifiers to attach hooks * Implement `sequential_epoch_end` * This callback should be called after one sequential layer has been calibrated with one epoch * This callback triggers compression and replaces passing a `callback_modifier` * Implement `calibration_epoch_end` * This callback triggers at the end of a calibration epoch, and is used to *trigger compression* in between pipelines composed using the independent pipeline and *remove hooks* in between independent pipelines ## Lifecycle Changes ## * Oneshot modifiers implement on_end, which removes hooks when calibration finishes * In the future, calibration_epoch_start is treated like batch_start, where it is an opportunity for modifiers to start * In the future, calibration_epoch_end is treated like batch_end, where it is an opportunity for modifiers to end * In the future, finalize is treated like batch_end, where it is an opportunity for modifiers to end * Right now, these opportunities are implemented manually on each oneshot modifier, rather than being a lifecycle rule ## Data Pipeline Changes ## * Implement data pipeline registry * Inferred pipeline is selected using modifiers and can be overridden by user * Implement independent pipeline * This pipeline treats each modifier as a separate stage and assigns a pipeline to each modifier * Meant to replicate current LC behavior * Originally, these compression events were triggered by reaching the end of each module’s initialize function. Now a separate event is required * Implement `session.get_modifiers` * In order to perform data pipeline inference and other sequential pipeline inference, these functions must get the list of active modifiers before they initialize * This function gets all the active modifiers across all `ModifierStages` * Prepare smoothquant for pipeline extraction * Trigger `_apply_smoothing` on the `sequential_epoch_end ` and `calibration_epoch_end` * Add a [guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285) which allows the `_apply_smoothing` function to be called multiple times per session (as is required by sequential pipeline) ## Testing ## * Quantized llama3-8b using both the independent (basic + sequential) and sequential pipelines * There was no accuracy regression from using a shared pipeline, although we keep the `independent` pipeline as the default for now * Transformers tests pass * https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074 --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

kylesayrs mentioned this pull request Apr 15, 2025

Pipeline Extraction #1279

Merged

kylesayrs force-pushed the kylesayrs/quantization-mixin branch from 8906be2 to 18f8341 Compare April 15, 2025 16:05

kylesayrs added the ready When a PR is ready for review label Apr 15, 2025

Implement QuantizationMixin

fa75986

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/quantization-mixin branch from 3213a7d to fa75986 Compare April 16, 2025 16:14

brian-dellabetta previously approved these changes Apr 16, 2025

View reviewed changes

reset_quantization_status

5b3e5eb

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed brian-dellabetta’s stale review via 5b3e5eb April 21, 2025 20:23

brian-dellabetta previously approved these changes Apr 21, 2025

View reviewed changes

dsikka previously requested changes Apr 21, 2025

View reviewed changes

src/llmcompressor/modifiers/quantization/calibration.py Show resolved Hide resolved

src/llmcompressor/modifiers/quantization/calibration.py Show resolved Hide resolved

rahul-tuli reviewed Apr 22, 2025

View reviewed changes

src/llmcompressor/modifiers/quantization/gptq/base.py Show resolved Hide resolved

rahul-tuli reviewed Apr 22, 2025

View reviewed changes

kylesayrs added 2 commits April 22, 2025 13:07

remove kv cache tests

b3aa8d4

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Merge branch 'kylesayrs/revert-kv-cache-tests' into kylesayrs/quantiz…

28ecfec

…ation-mixin

kylesayrs dismissed brian-dellabetta’s stale review via 28ecfec April 22, 2025 17:15

kylesayrs requested review from brian-dellabetta, dsikka and rahul-tuli April 22, 2025 19:44

brian-dellabetta previously approved these changes Apr 22, 2025

View reviewed changes

.github/workflows/test-check-transformers.yaml Show resolved Hide resolved

kylesayrs added 2 commits April 22, 2025 17:52

disable offending test

74668f4

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Merge branch 'kylesayrs/revert-kv-cache-tests' into kylesayrs/quantiz…

0019be5

…ation-mixin

kylesayrs dismissed brian-dellabetta’s stale review via 0019be5 April 22, 2025 21:54

kylesayrs added 2 commits April 23, 2025 15:02

Merge remote-tracking branch 'origin' into kylesayrs/quantization-mixin

5b93dd3

Merge branch 'main' into kylesayrs/quantization-mixin

3935734

brian-dellabetta previously approved these changes Apr 30, 2025

View reviewed changes

rahul-tuli previously approved these changes Apr 30, 2025

View reviewed changes

tests/llmcompressor/modifiers/calibration/test_kv_cache.py Show resolved Hide resolved

change QuantizationModifier functions

e4debea

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed stale reviews from rahul-tuli and brian-dellabetta via e4debea May 1, 2025 16:38

kylesayrs added 2 commits May 1, 2025 12:42

fix type issue

4966ce6

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix typo

e93bf5f

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta approved these changes May 1, 2025

View reviewed changes

rahul-tuli approved these changes May 1, 2025

View reviewed changes

Merge branch 'main' into kylesayrs/quantization-mixin

29bfc7e

kylesayrs enabled auto-merge (squash) May 2, 2025 16:14

kylesayrs merged commit dce5e81 into main May 2, 2025
8 checks passed

kylesayrs deleted the kylesayrs/quantization-mixin branch May 2, 2025 16:51

Conversation

kylesayrs commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Remove Support

Testing

Uh oh!

github-actions bot commented Apr 15, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

coolKeen commented Apr 17, 2025

Purpose

Changes

Remove Support

Testing

Uh oh!

kylesayrs commented Apr 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs commented May 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kylesayrs commented Apr 15, 2025 •

edited

Loading