Skip to content

Commit 81271b5

Browse files
authored
Update: Readme for fp8 support (#1304)
This PR updates the README to clarify that FP8 is only supported on GPUs with CUDA Compute Capability ≥ 9.0, such as NVIDIA's Hopper and Blackwell architectures. GPUs based on Ada Lovelace (Compute Capability 8.9) do not support FP8. --------- Signed-off-by: Rahul Tuli <[email protected]>
1 parent d2263cd commit 81271b5

File tree

2 files changed

+28
-23
lines changed

2 files changed

+28
-23
lines changed

README.md

Lines changed: 1 addition & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -25,29 +25,7 @@
2525

2626
### When to Use Which Optimization
2727

28-
#### PTQ
29-
PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are:
30-
31-
##### [W4A16](./examples/quantization_w4a16/README.md)
32-
- Uses GPTQ to compress weights to 4 bits. Requires calibration dataset.
33-
- Useful speed ups in low QPS regimes with more weight compression.
34-
- Recommended for any GPUs types.
35-
##### [W8A8-INT8](./examples/quantization_w8a8_int8/README.md)
36-
- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
37-
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
38-
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
39-
##### [W8A8-FP8](./examples/quantization_w8a8_fp8/README.md)
40-
- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
41-
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
42-
- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).
43-
44-
#### Sparsification
45-
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
46-
47-
##### [2:4-Sparsity with FP8 Weight, FP8 Input Activation](./examples/sparse_2of4_quantization_fp8/README.md)
48-
- Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits.
49-
- Useful for better inference than W8A8-fp8, with almost no drop in its evaluation score [blog](https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/). Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution.
50-
- Recommended for compute capability >8.9 (Hopper and Ada Lovelace).
28+
Please refer to [docs/schemes.md](./docs/schemes.md) for detailed information about available optimization schemes and their use cases.
5129

5230

5331
## Installation

docs/schemes.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Optimization Schemes
2+
3+
## PTQ
4+
PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are:
5+
6+
### [W4A16](../examples/quantization_w4a16/README.md)
7+
- Uses GPTQ to compress weights to 4 bits. Requires calibration dataset.
8+
- Useful speed ups in low QPS regimes with more weight compression.
9+
- Recommended for any GPUs types.
10+
11+
### [W8A8-INT8](../examples/quantization_w8a8_int8/README.md)
12+
- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
13+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
14+
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
15+
16+
### [W8A8-FP8](../examples/quantization_w8a8_fp8/README.md)
17+
- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
18+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
19+
- Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell).
20+
21+
## Sparsification
22+
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
23+
24+
### [2:4-Sparsity with FP8 Weight, FP8 Input Activation](../examples/sparse_2of4_quantization_fp8/README.md)
25+
- Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits.
26+
- Useful for better inference than W8A8-fp8, with almost no drop in its evaluation score [blog](https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/). Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution.
27+
- Recommended for compute capability >=9.0 (Hopper and Blackwell).

0 commit comments

Comments
 (0)