Skip to content

Commit 2d61c85

Browse files
improves consistency
1 parent 90eb08a commit 2d61c85

File tree

3 files changed

+7
-5
lines changed

3 files changed

+7
-5
lines changed

guides/ipynb/quantization/overview.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@
5757
"\n",
5858
" * **How it works:** Two signed 4-bit \"nibbles\" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.\n",
5959
" * **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.\n",
60-
" * **What to expect:** ~ smaller than FP32 (~ vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.\n",
60+
" * **What to expect:** ~8x smaller than FP32 (~4x vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.\n",
6161
"\n",
6262
"* **`GPTQ` (weight-only 2/3/4/8 bits)**: *Second-order, post-training* method minimizing layer output error.\n",
6363
"\n",
@@ -67,7 +67,7 @@
6767
"\n",
6868
"### Implementation notes\n",
6969
"\n",
70-
"* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
70+
"* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
7171
"* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
7272
"* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.\n",
7373
"\n",

guides/md/quantization/overview.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
5252

5353
* **How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.
5454
* **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.
55-
* **What to expect:** ~ smaller than FP32 (~ vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.
55+
* **What to expect:** ~8x smaller than FP32 (~4x vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.
5656

5757
* **`GPTQ` (weight-only 2/3/4/8 bits)**: *Second-order, post-training* method minimizing layer output error.
5858

@@ -62,7 +62,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
6262

6363
### Implementation notes
6464

65-
* For `int4`, Keras packs signed 4-bit values (range [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
65+
* For `int4`, Keras packs signed 4-bit values (range = [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
6666
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
6767
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
6868

@@ -142,6 +142,8 @@ Any composite layers that are built from the above (for example, `MultiHeadAtten
142142

143143
Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.
144144

145+
---
146+
145147
## Practical guidance
146148

147149
* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).

guides/quantization/overview.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
5959
### Implementation notes
6060
61-
* For `int4`, Keras packs signed 4-bit values (range [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
61+
* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
6262
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
6363
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
6464

0 commit comments

Comments
 (0)