Skip to content

Commit e5572b9

Browse files
improve formatting and add missing note
1 parent 791043f commit e5572b9

File tree

3 files changed

+17
-16
lines changed

3 files changed

+17
-16
lines changed

guides/ipynb/quantization_overview.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
"* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.\n",
3232
"* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).\n",
3333
"\n",
34-
"**Terminology**\n",
34+
"### Terminology\n",
3535
"* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.\n",
3636
"* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.\n",
3737
"* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).\n",
@@ -67,9 +67,9 @@
6767
"\n",
6868
"### Implementation notes\n",
6969
"\n",
70-
"* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
71-
"* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
72-
"* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.\n",
70+
"* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.\n",
71+
"* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
72+
"* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
7373
"\n",
7474
"## Quantizing Keras Models\n",
7575
"\n",

guides/md/quantization_overview.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,11 @@ At a high level, Keras supports:
2424
* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.
2525
* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).
2626

27-
> **Terminology**
28-
>
29-
> * *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
30-
> * *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
31-
> * *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
27+
### Terminology
28+
29+
* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
30+
* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
31+
* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
3232

3333
---
3434

@@ -38,7 +38,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
3838

3939
* **`int8` (8-bit integer)**: **joint weight + activation** PTQ.
4040

41-
* **How it works:** Values are linearly mapped to 8-bit integers with per-channel scales. Activations are calibrated using dynamic quantization (see note below).
41+
* **How it works:** Values are linearly mapped to 8-bit integers with per-channel scales. Activations are quantized using dynamic quantization (see note below).
4242
* **Why use it:** Good accuracy for many architectures; broad hardware support.
4343
* **What to expect:** ~4x smaller than FP32 parameters (~2x vs FP16) and lower activation bandwidth, with small accuracy loss on many tasks. Throughput gains depend on kernel availability and memory bandwidth.
4444

@@ -48,7 +48,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
4848
* **Why use it:** Mixed-precision training/inference with hardware acceleration while keeping floating-point semantics (since underflow/overflow characteristics differ from int).
4949
* **What to expect:** Competitive speed and memory reductions where FP8 kernels are available; accuracy varies by model, but is usually acceptable for most tasks.
5050

51-
* **`int4`**: Ultra-low-bit **weights** for aggressive compression; activations remain in higher precision (int8).
51+
* **`int4`**: Ultra-low-bit **weights** for aggressive compression; activations remain in higher precision (int8) and use dynamic quantization.
5252

5353
* **How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.
5454
* **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.
@@ -62,9 +62,9 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
6262

6363
### Implementation notes
6464

65-
* For `int4`, Keras packs signed 4-bit values (range = [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
66-
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
67-
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
65+
* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.
66+
* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
67+
* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
6868

6969
---
7070

guides/quantization_overview.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,9 @@
5858
5959
### Implementation notes
6060
61-
* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
62-
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
61+
* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.
62+
* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
63+
* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
6364
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
6465
6566
## Quantizing Keras Models

0 commit comments

Comments
 (0)