You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: guides/ipynb/quantization_overview.ipynb
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,7 @@
31
31
"* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.\n",
32
32
"* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).\n",
33
33
"\n",
34
-
"**Terminology**\n",
34
+
"### Terminology\n",
35
35
"* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.\n",
36
36
"* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.\n",
37
37
"* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).\n",
@@ -67,9 +67,9 @@
67
67
"\n",
68
68
"### Implementation notes\n",
69
69
"\n",
70
-
"* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
71
-
"* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
72
-
"* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.\n",
70
+
"* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.\n",
71
+
"* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
72
+
"* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
Copy file name to clipboardExpand all lines: guides/md/quantization_overview.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,11 +24,11 @@ At a high level, Keras supports:
24
24
* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.
25
25
* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).
26
26
27
-
> **Terminology**
28
-
>
29
-
> **Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
30
-
> **Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
31
-
> **Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
27
+
### Terminology
28
+
29
+
**Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
30
+
**Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
31
+
**Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
32
32
33
33
---
34
34
@@ -38,7 +38,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
***How it works:** Values are linearly mapped to 8-bit integers with per-channel scales. Activations are calibrated using dynamic quantization (see note below).
41
+
***How it works:** Values are linearly mapped to 8-bit integers with per-channel scales. Activations are quantized using dynamic quantization (see note below).
42
42
***Why use it:** Good accuracy for many architectures; broad hardware support.
43
43
***What to expect:**~4x smaller than FP32 parameters (~2x vs FP16) and lower activation bandwidth, with small accuracy loss on many tasks. Throughput gains depend on kernel availability and memory bandwidth.
44
44
@@ -48,7 +48,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
48
48
***Why use it:** Mixed-precision training/inference with hardware acceleration while keeping floating-point semantics (since underflow/overflow characteristics differ from int).
49
49
***What to expect:** Competitive speed and memory reductions where FP8 kernels are available; accuracy varies by model, but is usually acceptable for most tasks.
50
50
51
-
***`int4`**: Ultra-low-bit **weights** for aggressive compression; activations remain in higher precision (int8).
51
+
***`int4`**: Ultra-low-bit **weights** for aggressive compression; activations remain in higher precision (int8) and use dynamic quantization.
52
52
53
53
***How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.
54
54
***Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.
@@ -62,9 +62,9 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
62
62
63
63
### Implementation notes
64
64
65
-
*For `int4`, Keras packs signed 4-bit values (range = [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
66
-
*Activation scaling for`int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
67
-
*Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
65
+
***Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.
66
+
***4-bit packing**: For`int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
67
+
***Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
Copy file name to clipboardExpand all lines: guides/quantization_overview.py
+3-2Lines changed: 3 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -58,8 +58,9 @@
58
58
59
59
### Implementation notes
60
60
61
-
* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
62
-
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
61
+
* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.
62
+
* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
63
+
* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
63
64
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
0 commit comments