improve formatting and add missing note

JyotinderSingh · JyotinderSingh · commit e5572b960e5d · 2025-10-10T14:19:33.000+05:30
diff --git a/guides/ipynb/quantization_overview.ipynb b/guides/ipynb/quantization_overview.ipynb
@@ -31,7 +31,7 @@
     "* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.\n",
     "* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).\n",
     "\n",
-    "**Terminology**\n",
+    "### Terminology\n",
     "* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.\n",
     "* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.\n",
     "* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).\n",
@@ -67,9 +67,9 @@
     "\n",
     "### Implementation notes\n",
     "\n",
-    "* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
-    "* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
-    "* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.\n",
+    "* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.\n",
+    "* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
+    "* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
     "\n",
     "## Quantizing Keras Models\n",
     "\n",
diff --git a/guides/md/quantization_overview.md b/guides/md/quantization_overview.md
@@ -24,11 +24,11 @@ At a high level, Keras supports:
 * Joint weight + activation PTQ in `int4`, `int8`, and `float8`.
 * Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).
 
-> **Terminology**
->
-> * *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
-> * *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
-> * *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
+### Terminology
+
+* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
+* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
+* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
 
 ---
 
@@ -38,7 +38,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
 
 * **`int8` (8-bit integer)**: **joint weight + activation** PTQ.
 
-  * **How it works:** Values are linearly mapped to 8-bit integers with per-channel scales. Activations are calibrated using dynamic quantization (see note below).
+  * **How it works:** Values are linearly mapped to 8-bit integers with per-channel scales. Activations are quantized using dynamic quantization (see note below).
   * **Why use it:** Good accuracy for many architectures; broad hardware support.
   * **What to expect:** ~4x smaller than FP32 parameters (~2x vs FP16) and lower activation bandwidth, with small accuracy loss on many tasks. Throughput gains depend on kernel availability and memory bandwidth.
 
@@ -48,7 +48,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
   * **Why use it:** Mixed-precision training/inference with hardware acceleration while keeping floating-point semantics (since underflow/overflow characteristics differ from int).
   * **What to expect:** Competitive speed and memory reductions where FP8 kernels are available; accuracy varies by model, but is usually acceptable for most tasks.
 
-* **`int4`**: Ultra-low-bit **weights** for aggressive compression; activations remain in higher precision (int8).
+* **`int4`**: Ultra-low-bit **weights** for aggressive compression; activations remain in higher precision (int8) and use dynamic quantization.
 
   * **How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.
   * **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.
@@ -62,9 +62,9 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
 
 ### Implementation notes
 
-* For `int4`, Keras packs signed 4-bit values (range = [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
-* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
-* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
+* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.
+* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
+* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
 
 ---
 
diff --git a/guides/quantization_overview.py b/guides/quantization_overview.py
@@ -58,8 +58,9 @@
 
 ### Implementation notes
 
-* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
-* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
+* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.
+* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
+* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
 * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
 
 ## Quantizing Keras Models