Skip to content

Commit 1176f07

Browse files
fix py script
1 parent 746f4d1 commit 1176f07

File tree

1 file changed

+13
-14
lines changed

1 file changed

+13
-14
lines changed

guides/quantization/overview.py

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,10 @@
2121
* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.
2222
* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).
2323
24-
> **Terminology**
25-
>
26-
> * *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
27-
> * *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
28-
> * *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
24+
**Terminology**
25+
* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
26+
* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
27+
* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
2928
3029
3130
## Quantization Modes
@@ -56,11 +55,11 @@
5655
* **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.
5756
* **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.
5857
59-
> **Implementation notes**
60-
>
61-
> * For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
62-
> * Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
63-
> * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
58+
**Implementation notes**
59+
60+
* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
61+
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
62+
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
6463
6564
## Quantizing Keras Models
6665
@@ -134,8 +133,8 @@
134133
135134
Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.
136135
137-
> **Practical guidance**
138-
>
139-
> * For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
140-
> * Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.
136+
**Practical guidance**
137+
138+
* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
139+
* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.
141140
"""

0 commit comments

Comments
 (0)