Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
8906be2 to
18f8341
Compare
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
3213a7d to
fa75986
Compare
brian-dellabetta
left a comment
There was a problem hiding this comment.
This LGTM! slowly understanding more
Hi, I see that you mention SpinQuant + GPTQ? Does llm compressor support such recipe now? |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
rahul-tuli
left a comment
There was a problem hiding this comment.
LGTM pending comment; Thank you, nice work!
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
brian-dellabetta
left a comment
There was a problem hiding this comment.
Just make sure your other PR goes in first, or drop the changes to kv cache tests from this one
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
brian-dellabetta
left a comment
There was a problem hiding this comment.
good stuff. always nice when adding an abstraction leads to more lines removed than added 🔥
rahul-tuli
left a comment
There was a problem hiding this comment.
Synced offline on scope of testing; Changes introduced by this diff are tested by #1279 by virtue of lm_eval tests! Changes look good.
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
e4debea
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
## Purpose ## * Abstract functionality which allows modifiers to act as quantization configs into a mixin called `QuantizationMixin` * This gives #1279 an interface to properly infer which pipeline to use based on the recipe (if a recipe contains modifiers requires calibration, then use the "basic" or "sequential" pipelines) * This enables future modifiers to act as quantization modifiers (in the same way that GPTQ does now) * Related to #1354 where previous logic would attempt to add a QuantizedKVCache for dynamic kv_quant ## Changes ## * Implement `QuantizationMixin` which implements five public methods * Lifecycle methods * `initialize_quantization` is used to apply a config and attach observers to a model * quantization is disabled so that modules aren't quantized before they're calibrated * `start_calibration` is used to initialize calibration hooks and status * quantization is enabled, since we currently quantize as we calibrate, although this decision is somewhat arbitrary * `end_calibration` is used to remove calibration hooks and apply the frozen status * quantization remains enabled, since we want future forward passes to simulate quantization * Recipe-related methods * `has_config` returns true if a config was specified, used for checking against duplicate configs in the recipe * `resolve_quantization_config` returns the quantization config specified by the modifier fields * `QuantizationModifier` inherits from `QuantizationMixin` * `GPTQModifier` inherits from `QuantizationMixin` * Unlike QMod, GPTQ disables quantization during calibration. As noted before, this is a somewhat arbitrary choice but one which matches the current implementation * Calibration utils * Replace `set_unset_kv_cache` with `initialize_quantized_kv_cache` and `freeze_module_quantization` * Treat the `QuantizedKVCache` as analogous to another observer * Pull setting the calibration status out of`update_weight_zp_scale` * This better matches the lifecycle detailed in `QuantizationMixin` description * Implement `reset_quantization_status` which is used to remove any existing quantization configs before the current config is applied by `initialize_quantization` ## Remove Support ## * Removing support for recipe with multiple quantization modifiers active at the same time (a check for this will be added by #1279) * Remove `num_calibration_steps`, `quantize`, `disable_quantization_observer_epoch` and `min_tokens_per_module` * `num_calibration_steps` is already controlled by https://github.com/vllm-project/llm-compressor/blob/42b62f5283d0234b26623fe1f1bf02a77c6e4019/src/llmcompressor/datasets/utils.py#L106 * `quantize` was implemented as a workaround for GPTQ's modifier builder. Similar functionality may be require to support SpinQuant + GPTQ, but such functionality should exist at a higher level * `disable_quantization_observer_epoch` seems to implement functionality where a model's observers are removed but quantization remains active. This functionality is maintained by setting an "end" epoch for qmod * `min_tokens_per_module` requires that the modifier have references to the calibration dataset, which is disallowed by #1279. This information is already printed in GPTQ's logs. If research still wants this tool specifically for `QuantizationModifier`, then it can be reimplemented to avoid using references to the calibration dataset ## Testing ## * Updated tests to reflect new mixin * Ran a set of GPTQ and QuantizationModifier examples to completion * CI tests pass --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
## Purpose ## * Extract data pipelines from modifiers to enable multiple modifiers to be active at the same time * This enables faster compression of larger models * This enables more memory efficient compression of larger models (not limited to just GPTQ/SGPT) ## Prerequisites ## * #1351 * #1298 ## Callback Changes ## * Implement `calibration_epoch_start` * This callback should be called at the start of every calibration pipeline * This callback causes modifiers to attach hooks * Implement `sequential_epoch_end` * This callback should be called after one sequential layer has been calibrated with one epoch * This callback triggers compression and replaces passing a `callback_modifier` * Implement `calibration_epoch_end` * This callback triggers at the end of a calibration epoch, and is used to *trigger compression* in between pipelines composed using the independent pipeline and *remove hooks* in between independent pipelines ## Lifecycle Changes ## * Oneshot modifiers implement on_end, which removes hooks when calibration finishes * In the future, calibration_epoch_start is treated like batch_start, where it is an opportunity for modifiers to start * In the future, calibration_epoch_end is treated like batch_end, where it is an opportunity for modifiers to end * In the future, finalize is treated like batch_end, where it is an opportunity for modifiers to end * Right now, these opportunities are implemented manually on each oneshot modifier, rather than being a lifecycle rule ## Data Pipeline Changes ## * Implement data pipeline registry * Inferred pipeline is selected using modifiers and can be overridden by user * Implement independent pipeline * This pipeline treats each modifier as a separate stage and assigns a pipeline to each modifier * Meant to replicate current LC behavior * Originally, these compression events were triggered by reaching the end of each module’s initialize function. Now a separate event is required * Implement `session.get_modifiers` * In order to perform data pipeline inference and other sequential pipeline inference, these functions must get the list of active modifiers before they initialize * This function gets all the active modifiers across all `ModifierStages` * Prepare smoothquant for pipeline extraction * Trigger `_apply_smoothing` on the `sequential_epoch_end ` and `calibration_epoch_end` * Add a [guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285) which allows the `_apply_smoothing` function to be called multiple times per session (as is required by sequential pipeline) ## Testing ## * Quantized llama3-8b using both the independent (basic + sequential) and sequential pipelines * There was no accuracy regression from using a shared pipeline, although we keep the `independent` pipeline as the default for now * Transformers tests pass * https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074 --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
## Purpose ## * Extract data pipelines from modifiers to enable multiple modifiers to be active at the same time * This enables faster compression of larger models * This enables more memory efficient compression of larger models (not limited to just GPTQ/SGPT) ## Prerequisites ## * #1351 * #1298 ## Callback Changes ## * Implement `calibration_epoch_start` * This callback should be called at the start of every calibration pipeline * This callback causes modifiers to attach hooks * Implement `sequential_epoch_end` * This callback should be called after one sequential layer has been calibrated with one epoch * This callback triggers compression and replaces passing a `callback_modifier` * Implement `calibration_epoch_end` * This callback triggers at the end of a calibration epoch, and is used to *trigger compression* in between pipelines composed using the independent pipeline and *remove hooks* in between independent pipelines ## Lifecycle Changes ## * Oneshot modifiers implement on_end, which removes hooks when calibration finishes * In the future, calibration_epoch_start is treated like batch_start, where it is an opportunity for modifiers to start * In the future, calibration_epoch_end is treated like batch_end, where it is an opportunity for modifiers to end * In the future, finalize is treated like batch_end, where it is an opportunity for modifiers to end * Right now, these opportunities are implemented manually on each oneshot modifier, rather than being a lifecycle rule ## Data Pipeline Changes ## * Implement data pipeline registry * Inferred pipeline is selected using modifiers and can be overridden by user * Implement independent pipeline * This pipeline treats each modifier as a separate stage and assigns a pipeline to each modifier * Meant to replicate current LC behavior * Originally, these compression events were triggered by reaching the end of each module’s initialize function. Now a separate event is required * Implement `session.get_modifiers` * In order to perform data pipeline inference and other sequential pipeline inference, these functions must get the list of active modifiers before they initialize * This function gets all the active modifiers across all `ModifierStages` * Prepare smoothquant for pipeline extraction * Trigger `_apply_smoothing` on the `sequential_epoch_end ` and `calibration_epoch_end` * Add a [guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285) which allows the `_apply_smoothing` function to be called multiple times per session (as is required by sequential pipeline) ## Testing ## * Quantized llama3-8b using both the independent (basic + sequential) and sequential pipelines * There was no accuracy regression from using a shared pipeline, although we keep the `independent` pipeline as the default for now * Transformers tests pass * https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074 --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Purpose
QuantizationMixinChanges
Implement
QuantizationMixinwhich implements five public methodsinitialize_quantizationis used to apply a config and attach observers to a modelstart_calibrationis used to initialize calibration hooks and statusend_calibrationis used to remove calibration hooks and apply the frozen statushas_configreturns true if a config was specified, used for checking against duplicate configs in the reciperesolve_quantization_configreturns the quantization config specified by the modifier fieldsQuantizationModifierinherits fromQuantizationMixinGPTQModifierinherits fromQuantizationMixinCalibration utils
set_unset_kv_cachewithinitialize_quantized_kv_cacheandfreeze_module_quantizationQuantizedKVCacheas analogous to another observerupdate_weight_zp_scaleQuantizationMixindescriptionreset_quantization_statuswhich is used to remove any existing quantization configs before the current config is applied byinitialize_quantizationRemove Support
num_calibration_steps,quantize,disable_quantization_observer_epochandmin_tokens_per_modulenum_calibration_stepsis already controlled byllm-compressor/src/llmcompressor/datasets/utils.py
Line 106 in 42b62f5
quantizewas implemented as a workaround for GPTQ's modifier builder. Similar functionality may be require to support SpinQuant + GPTQ, but such functionality should exist at a higher leveldisable_quantization_observer_epochseems to implement functionality where a model's observers are removed but quantization remains active. This functionality is maintained by setting an "end" epoch for qmodmin_tokens_per_modulerequires that the modifier have references to the calibration dataset, which is disallowed by Pipeline Extraction #1279. This information is already printed in GPTQ's logs. If research still wants this tool specifically forQuantizationModifier, then it can be reimplemented to avoid using references to the calibration datasetTesting