|
| 1 | +# Modifiers Overview |
| 2 | + |
| 3 | +A `Modifier` in `llm-compressor` is an algorithm that can be applied to a model to change |
| 4 | +its state in some way. Some modifiers can be applied during one-shot, while others |
| 5 | +are relevant only during training. Below is a summary of the key modifiers available. |
| 6 | + |
| 7 | +## Pruning Modifiers |
| 8 | + |
| 9 | +Modifiers that introduce sparsity into a model |
| 10 | + |
| 11 | +### [SparseGPT](./obcq/base.py) |
| 12 | +One-shot algorithm that uses calibration data to introduce unstructured or structured |
| 13 | +sparsity into weights. Implementation based on [SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://arxiv.org/abs/2301.00774). A small amount of calibration data is used |
| 14 | +to calculate a Hessian for each layers input activations, this Hessian is then used to |
| 15 | +solve a regression problem that minimizes the error introduced by a target sparsity. This algorithm |
| 16 | +has a good amount of memory overhead introduced by storing the Hessians. |
| 17 | + |
| 18 | +### [WANDA](./pruning/wanda/base.py) |
| 19 | +One-shot algorithm that uses calibration data to introduce unstructured or structured sparsity. Implementation is |
| 20 | +based on [A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/pdf/2306.11695). |
| 21 | +Calibration data is used to calculate the magnitude of input activations for each layer, and weights |
| 22 | +are pruned based on this magnitude combined with their distance from 0. This requires less |
| 23 | +memory overhead and computation than SparseGPT, but reduces accuracy in many cases. |
| 24 | + |
| 25 | +### [Magnitude Pruning](./pruning/magnitude/base.py) |
| 26 | +Naive one-shot pruning algorithm that does not require any calibration data. Weights are |
| 27 | +pruned based solely on their distance from 0 up to the target sparsity. |
| 28 | + |
| 29 | +## Quantization Modifiers |
| 30 | + |
| 31 | +Modifiers that quantize weights or activations of a model |
| 32 | + |
| 33 | +### [Basic Quantization](./quantization/quantization/base.py) |
| 34 | +One-shot algorithm that quantizes weights, input activations and/or output activations by |
| 35 | +calculating a range from weights or calibration data. All data is quantized to the closest |
| 36 | +bin using a scale and (optional) zero point. This basic quantization algorithm is |
| 37 | +suitable for FP8 quantization. A variety of quantization schemes are supported via the |
| 38 | +[compressed-tensors](https://github.com/neuralmagic/compressed-tensors) library. |
| 39 | + |
| 40 | +### [GPTQ](./quantization/gptq/base.py) |
| 41 | +One-shot algorithm that uses calibration data to select the ideal bin for weight quantization. |
| 42 | +This algorithm is applied on top of the basic quantization algorithm, and affects weights only. |
| 43 | +The implementation is based on [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/pdf/2210.17323). The algorithm is very similar to SparseGPT: A small amount of calibration data is used |
| 44 | +to calculate a Hessian for each layers input activations, this Hessian is then used to |
| 45 | +solve a regression problem that minimizes the error introduced by a given quantization configuration. This algorithm |
| 46 | +has a good amount of memory overhead introduced by storing the Hessians. |
| 47 | + |
| 48 | +## "Helper" Modifiers |
| 49 | + |
| 50 | +These modifiers do not introduce sparsity or quantization themselves, but are used |
| 51 | +in conjunction with one of the above modifiers to improve their accuracy. |
| 52 | + |
| 53 | +### [SmoothQuant](./smoothquant/base.py) |
| 54 | +The modifier is intended to be used prior to a `QuantizationModifier` or `GPTQModifier`. Its purpose is |
| 55 | +to make input activations easier to quantize by smoothing away outliers in the inputs, and applying the inverse |
| 56 | +smoothing operation to the following weights. This makes weights slightly harder to quantize, but the inputs much |
| 57 | +easier to quantize. The implementation is based on [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://arxiv.org/pdf/2211.10438) and requires calibration data. |
| 58 | + |
| 59 | +### [Logarithmic Equalization](./logarithmic_equalization/base.py) |
| 60 | +Very similar to `SmoothQuantModifier`, but applies smoothing on an inverse log scale |
| 61 | +rather than the linear smoothing done by SmoothQuant. The implementation is based on |
| 62 | +[FPTQ: Fine-grained Post-Training Quantization for Large Language Models](https://arxiv.org/pdf/2308.15987) |
| 63 | + |
| 64 | +### [Constant Pruning](./pruning/constant/base.py) |
| 65 | +One-shot pruning algorithms often introduce accuracy degradation that can be recovered with finetuning. This |
| 66 | +modifier ensures that the sparsity mask of the model is maintained during finetuning, allowing a sparse |
| 67 | +model to recover accuracy while maintaining its sparsity structure. It is intended to be used after a pruning modifier |
| 68 | +such as `SparseGPT` or `WANDA` has already been applied. |
| 69 | + |
| 70 | +### [Distillation](./distillation/output/base.py) |
| 71 | +To better recover accuracy of sparse models during finetuning, we can also use a teacher model of the same architecture |
| 72 | +to influence the loss. This modifier is intended to be used in conjunction with `ConstantPruning` modifier on a |
| 73 | +pruned model, with the dense version of the model being used as the teacher. Both output distillation loss and |
| 74 | +layer-by-layer distillation loss are supported. The layer-by-layer implementation follows the Square Head distillation |
| 75 | +algorithm presented in [Sparse Fine-tuning for Inference Acceleration of Large Language Models](https://arxiv.org/pdf/2310.06927). |
0 commit comments