You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: guides/quantization/overview.py
+13-14Lines changed: 13 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -21,11 +21,10 @@
21
21
* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.
22
22
* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).
23
23
24
-
> **Terminology**
25
-
>
26
-
> * *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
27
-
> * *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
28
-
> * *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
24
+
**Terminology**
25
+
* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
26
+
* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
27
+
* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
29
28
30
29
31
30
## Quantization Modes
@@ -56,11 +55,11 @@
56
55
* **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.
57
56
* **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.
58
57
59
-
> **Implementation notes**
60
-
>
61
-
> * For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
62
-
> * Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
63
-
> * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
58
+
**Implementation notes**
59
+
60
+
* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
61
+
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
62
+
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
64
63
65
64
## Quantizing Keras Models
66
65
@@ -134,8 +133,8 @@
134
133
135
134
Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.
136
135
137
-
> **Practical guidance**
138
-
>
139
-
> * For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
140
-
> * Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.
136
+
**Practical guidance**
137
+
138
+
* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
139
+
* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.
0 commit comments