Skip to content

Commit 77f377b

Browse files
Sara Adkinskylesayrs
andauthored
README for Modifiers (#165)
* WIP readme for modifiers * fill out modifier summaries * fix typo * fix typo --------- Co-authored-by: Kyle Sayers <[email protected]>
1 parent e044902 commit 77f377b

File tree

1 file changed

+75
-0
lines changed

1 file changed

+75
-0
lines changed
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Modifiers Overview
2+
3+
A `Modifier` in `llm-compressor` is an algorithm that can be applied to a model to change
4+
its state in some way. Some modifiers can be applied during one-shot, while others
5+
are relevant only during training. Below is a summary of the key modifiers available.
6+
7+
## Pruning Modifiers
8+
9+
Modifiers that introduce sparsity into a model
10+
11+
### [SparseGPT](./obcq/base.py)
12+
One-shot algorithm that uses calibration data to introduce unstructured or structured
13+
sparsity into weights. Implementation based on [SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://arxiv.org/abs/2301.00774). A small amount of calibration data is used
14+
to calculate a Hessian for each layers input activations, this Hessian is then used to
15+
solve a regression problem that minimizes the error introduced by a target sparsity. This algorithm
16+
has a good amount of memory overhead introduced by storing the Hessians.
17+
18+
### [WANDA](./pruning/wanda/base.py)
19+
One-shot algorithm that uses calibration data to introduce unstructured or structured sparsity. Implementation is
20+
based on [A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/pdf/2306.11695).
21+
Calibration data is used to calculate the magnitude of input activations for each layer, and weights
22+
are pruned based on this magnitude combined with their distance from 0. This requires less
23+
memory overhead and computation than SparseGPT, but reduces accuracy in many cases.
24+
25+
### [Magnitude Pruning](./pruning/magnitude/base.py)
26+
Naive one-shot pruning algorithm that does not require any calibration data. Weights are
27+
pruned based solely on their distance from 0 up to the target sparsity.
28+
29+
## Quantization Modifiers
30+
31+
Modifiers that quantize weights or activations of a model
32+
33+
### [Basic Quantization](./quantization/quantization/base.py)
34+
One-shot algorithm that quantizes weights, input activations and/or output activations by
35+
calculating a range from weights or calibration data. All data is quantized to the closest
36+
bin using a scale and (optional) zero point. This basic quantization algorithm is
37+
suitable for FP8 quantization. A variety of quantization schemes are supported via the
38+
[compressed-tensors](https://github.com/neuralmagic/compressed-tensors) library.
39+
40+
### [GPTQ](./quantization/gptq/base.py)
41+
One-shot algorithm that uses calibration data to select the ideal bin for weight quantization.
42+
This algorithm is applied on top of the basic quantization algorithm, and affects weights only.
43+
The implementation is based on [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/pdf/2210.17323). The algorithm is very similar to SparseGPT: A small amount of calibration data is used
44+
to calculate a Hessian for each layers input activations, this Hessian is then used to
45+
solve a regression problem that minimizes the error introduced by a given quantization configuration. This algorithm
46+
has a good amount of memory overhead introduced by storing the Hessians.
47+
48+
## "Helper" Modifiers
49+
50+
These modifiers do not introduce sparsity or quantization themselves, but are used
51+
in conjunction with one of the above modifiers to improve their accuracy.
52+
53+
### [SmoothQuant](./smoothquant/base.py)
54+
The modifier is intended to be used prior to a `QuantizationModifier` or `GPTQModifier`. Its purpose is
55+
to make input activations easier to quantize by smoothing away outliers in the inputs, and applying the inverse
56+
smoothing operation to the following weights. This makes weights slightly harder to quantize, but the inputs much
57+
easier to quantize. The implementation is based on [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://arxiv.org/pdf/2211.10438) and requires calibration data.
58+
59+
### [Logarithmic Equalization](./logarithmic_equalization/base.py)
60+
Very similar to `SmoothQuantModifier`, but applies smoothing on an inverse log scale
61+
rather than the linear smoothing done by SmoothQuant. The implementation is based on
62+
[FPTQ: Fine-grained Post-Training Quantization for Large Language Models](https://arxiv.org/pdf/2308.15987)
63+
64+
### [Constant Pruning](./pruning/constant/base.py)
65+
One-shot pruning algorithms often introduce accuracy degradation that can be recovered with finetuning. This
66+
modifier ensures that the sparsity mask of the model is maintained during finetuning, allowing a sparse
67+
model to recover accuracy while maintaining its sparsity structure. It is intended to be used after a pruning modifier
68+
such as `SparseGPT` or `WANDA` has already been applied.
69+
70+
### [Distillation](./distillation/output/base.py)
71+
To better recover accuracy of sparse models during finetuning, we can also use a teacher model of the same architecture
72+
to influence the loss. This modifier is intended to be used in conjunction with `ConstantPruning` modifier on a
73+
pruned model, with the dense version of the model being used as the teacher. Both output distillation loss and
74+
layer-by-layer distillation loss are supported. The layer-by-layer implementation follows the Square Head distillation
75+
algorithm presented in [Sparse Fine-tuning for Inference Acceleration of Large Language Models](https://arxiv.org/pdf/2310.06927).

0 commit comments

Comments
 (0)