You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Multi-modifier] Support scoped application of quantization config/status (#1772)
SUMMARY:
Prerequisites:
* neuralmagic/compressed-tensors#432
This allows for multi-modifier support by scoping the application of
quantization config/status to only the modules in the model that match
the given targets/ignore configuration, rather than all modules.
Initialization of observers is moved to on_start (instead of
on_initialize) to match their removal on_end (and not on_finalize). This
prevents collision during the multi-modifier lifecycle
- [x] Update AWQ
- [x] Update QuantizationModifier
- [x] Update QuantizationMixin
- [x] Update GPTQ
- [x] No other quantization modifiers exist
TEST PLAN:
- Tests were added to
neuralmagic/compressed-tensors#432 to confirm
correct application of multiple modifiers.
- Added an example in this PR to show how AWQ and GPTQ can be applied
heterogeneously to a model, along with a small README. Logs show
alternating AWQ and GPTQ messages for `"sequential"`, and correct
behavior for `"independent"` pipelines. [Model
checkpoint](https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-selfattn-w8a8-mlp-w4a16-sequential/tree/main)
for the sequential pipeline shows correct application of W8A8 to
self_attn layers and W4A16 to mlp layers. config.json and safetensors
weights all look as expected
---------
Signed-off-by: Brian Dellabetta <[email protected]>
Copy file name to clipboardExpand all lines: examples/quantization_non_uniform/README.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,3 +9,12 @@ We demonstrate mixed precision by quantizing models to both int8 and int4, and i
9
9
## Multiple Strategies
10
10
11
11
It may also be interesting to quantize a model with two different [quantization strategies](https://github.com/neuralmagic/compressed-tensors/blob/a2bfc03e9d52824ba5d6d2a50c8741dd9bccd5d3/src/compressed_tensors/quantization/quant_args.py#L93) such as group, channel, or per-tensor. [Here](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_non_uniform/quantization_fp8_multiple_strategies.py) we apply fp8 quantization where all the attention weights are quantized using the per-channel strategy, and all the mlp weights are quantized using per-tensor. This is accomplished through defining multiple config groups in the recipe. The produced model is compressed using the `float-quantized` compressor and can be directly run in vllm.
12
+
13
+
## Quantization with Multiple Quantization Modifiers
14
+
15
+
This section outlines how multiple quantization modifiers can be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This heterogeneous application of multiple modifiers comes in 2 flavors:
16
+
17
+
1. Run every modifier in a single, sequential pipeline, performing a single calibrated run. See `./quantization_multiple_modifiers.py` for an example.
18
+
2. Run each modifier in its own, independent pipeline, performing a calibrated run for each modifier. To run each modifier independently, run `./quantization_multiple_modifiers.py` with `oneshot(..., pipeline="independent")` instead of `pipeline="sequential"`.
19
+
20
+
This is an advanced usage of `llm-compressor` and an active area of research. Best practices will be provided in a future release, after further research and sensitivity analysis.
0 commit comments