You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Purpose ##
* Extract data pipelines from modifiers to enable multiple modifiers to
be active at the same time
* This enables faster compression of larger models
* This enables more memory efficient compression of larger models (not
limited to just GPTQ/SGPT)
## Prerequisites ##
* #1351
* #1298
## Callback Changes ##
* Implement `calibration_epoch_start`
* This callback should be called at the start of every calibration
pipeline
* This callback causes modifiers to attach hooks
* Implement `sequential_epoch_end`
* This callback should be called after one sequential layer has been
calibrated with one epoch
* This callback triggers compression and replaces passing a
`callback_modifier`
* Implement `calibration_epoch_end`
* This callback triggers at the end of a calibration epoch, and is used
to *trigger compression* in between pipelines composed using the
independent pipeline and *remove hooks* in between independent pipelines
## Lifecycle Changes ##
* Oneshot modifiers implement on_end, which removes hooks when
calibration finishes
* In the future, calibration_epoch_start is treated like batch_start,
where it is an opportunity for modifiers to start
* In the future, calibration_epoch_end is treated like batch_end, where
it is an opportunity for modifiers to end
* In the future, finalize is treated like batch_end, where it is an
opportunity for modifiers to end
* Right now, these opportunities are implemented manually on each
oneshot modifier, rather than being a lifecycle rule
## Data Pipeline Changes ##
* Implement data pipeline registry
* Inferred pipeline is selected using modifiers and can be overridden by
user
* Implement independent pipeline
* This pipeline treats each modifier as a separate stage and assigns a
pipeline to each modifier
* Meant to replicate current LC behavior
* Originally, these compression events were triggered by reaching the
end of each module’s initialize function. Now a separate event is
required
* Implement `session.get_modifiers`
* In order to perform data pipeline inference and other sequential
pipeline inference, these functions must get the list of active
modifiers before they initialize
* This function gets all the active modifiers across all
`ModifierStages`
* Prepare smoothquant for pipeline extraction
* Trigger `_apply_smoothing` on the `sequential_epoch_end ` and
`calibration_epoch_end`
* Add a
[guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285)
which allows the `_apply_smoothing` function to be called multiple times
per session (as is required by sequential pipeline)
## Testing ##
* Quantized llama3-8b using both the independent (basic + sequential)
and sequential pipelines
* There was no accuracy regression from using a shared pipeline,
although we keep the `independent` pipeline as the default for now
* Transformers tests pass
*
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
0 commit comments