You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: guides/ipynb/quantization/overview.ipynb
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -57,7 +57,7 @@
57
57
"\n",
58
58
" * **How it works:** Two signed 4-bit \"nibbles\" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.\n",
59
59
" * **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.\n",
60
-
" * **What to expect:** ~8× smaller than FP32 (~4× vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.\n",
60
+
" * **What to expect:** ~8x smaller than FP32 (~4x vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.\n",
"* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
70
+
"* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
71
71
"* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
72
72
"* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.\n",
Copy file name to clipboardExpand all lines: guides/md/quantization/overview.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
52
52
53
53
***How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.
54
54
***Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.
55
-
***What to expect:**~8× smaller than FP32 (~4× vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.
55
+
***What to expect:**~8x smaller than FP32 (~4x vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.
@@ -62,7 +62,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
62
62
63
63
### Implementation notes
64
64
65
-
* For `int4`, Keras packs signed 4-bit values (range ≈[−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
65
+
* For `int4`, Keras packs signed 4-bit values (range =[−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
66
66
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
67
67
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
68
68
@@ -142,6 +142,8 @@ Any composite layers that are built from the above (for example, `MultiHeadAtten
142
142
143
143
Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.
144
144
145
+
---
146
+
145
147
## Practical guidance
146
148
147
149
* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
Copy file name to clipboardExpand all lines: guides/quantization/overview.py
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -58,7 +58,7 @@
58
58
59
59
### Implementation notes
60
60
61
-
* For `int4`, Keras packs signed 4-bit values (range ≈ [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
61
+
* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
62
62
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
63
63
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
0 commit comments