You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
37
-
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
38
-
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
39
+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
40
+
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
41
-
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
42
-
- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).
43
+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
44
+
- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).
43
45
44
46
#### Sparsification
45
47
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
Copy file name to clipboardExpand all lines: src/llmcompressor/entrypoints/README.md
+9-7Lines changed: 9 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,7 @@
1
1
# Compression and Fine-tuning Entrypoint
2
2
3
+
4
+
3
5
## Oneshot
4
6
5
7
An ideal compression technique reduces memory footprint while maintaining accuracy. One-shot in LLM-Compressor supports faster inference on vLLM by applying post-training quantization (PTQ) or sparsification.
@@ -17,7 +19,7 @@ Sparsification reduces model complexity by pruning selected weight values to zer
17
19
18
20
## Code
19
21
20
-
Example scripts for all the above formats are located in the [examples](../../../examples/) folder. The [W8A8-FP8](../../../examples/quantization_w8a8_fp8/llama3_example.py) example is shown below:
22
+
Example scripts for all the above formats are located in the [examples](../../../examples/) folder. The [W8A8-FP8](../../../examples/quantization_w8a8_fp8/llama3_example.py) example is shown below:
21
23
22
24
```python
23
25
from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -68,7 +70,7 @@ oneshot(
68
70
...,
69
71
output_dir="./oneshot_model", # Automatically save the safetensor, config, recipe. Weights are saved in a compressed format
70
72
)
71
-
```
73
+
```
72
74
73
75
74
76
### Lifecycle
@@ -81,9 +83,9 @@ The oneshot calibration lifecycle consists of three steps:
81
83
- Patches the model to include additional functionality for saving with
82
84
quantization configurations.
83
85
2.**Oneshot Calibration**:
84
-
- Compresses the model based on the recipe (instructions for optimizing the model). The
86
+
- Compresses the model based on the recipe (instructions for optimizing the model). The
85
87
recipe defines the `Modifiers` (e.g., `GPTQModifier`, `SparseGPTModifier`) to apply, which
86
-
contain logic how to quantize or sparsify a model.
88
+
contain logic how to quantize or sparsify a model.
87
89
3.**Postprocessing**:
88
90
- Saves the model, tokenizer/processor, and configuration to the specified
89
91
`output_dir`.
@@ -147,7 +149,7 @@ Comparisons are defined in `/src/llmcompressor/modifiers/distillation/utils/pyto
0 commit comments