|
25 | 25 |
|
26 | 26 | ### When to Use Which Optimization |
27 | 27 |
|
28 | | -#### PTQ |
29 | | -PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are: |
30 | | - |
31 | | -##### [W4A16](./examples/quantization_w4a16/README.md) |
32 | | -- Uses GPTQ to compress weights to 4 bits. Requires calibration dataset. |
33 | | -- Useful speed ups in low QPS regimes with more weight compression. |
34 | | -- Recommended for any GPUs types. |
35 | | -##### [W8A8-INT8](./examples/quantization_w8a8_int8/README.md) |
36 | | -- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM. |
37 | | -- Useful for speed ups in high QPS regimes or offline serving on vLLM. |
38 | | -- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). |
39 | | -##### [W8A8-FP8](./examples/quantization_w8a8_fp8/README.md) |
40 | | -- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM. |
41 | | -- Useful for speed ups in high QPS regimes or offline serving on vLLM. |
42 | | -- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace). |
43 | | - |
44 | | -#### Sparsification |
45 | | -Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include: |
46 | | - |
47 | | -##### [2:4-Sparsity with FP8 Weight, FP8 Input Activation](./examples/sparse_2of4_quantization_fp8/README.md) |
48 | | -- Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits. |
49 | | -- Useful for better inference than W8A8-fp8, with almost no drop in its evaluation score [blog](https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/). Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution. |
50 | | -- Recommended for compute capability >8.9 (Hopper and Ada Lovelace). |
| 28 | +Please refer to [docs/schemes.md](./docs/schemes.md) for detailed information about available optimization schemes and their use cases. |
51 | 29 |
|
52 | 30 |
|
53 | 31 | ## Installation |
|
0 commit comments