Skip to content

Commit 90eb08a

Browse files
improves consistency
1 parent a5c6141 commit 90eb08a

File tree

1 file changed

+5
-4
lines changed

1 file changed

+5
-4
lines changed

guides/quantization/overview.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@
2121
* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.
2222
* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).
2323
24-
**Terminology**
24+
### Terminology
25+
2526
* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
2627
* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
2728
* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
@@ -55,7 +56,7 @@
5556
* **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.
5657
* **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.
5758
58-
**Implementation notes**
59+
### Implementation notes
5960
6061
* For `int4`, Keras packs signed 4-bit values (range ≈ [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
6162
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
@@ -114,7 +115,7 @@
114115
layer.quantize("int4") # Or "int8", "float8", etc.
115116

116117
"""
117-
**When to use layer-wise quantization**
118+
### When to use layer-wise quantization
118119
119120
* To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers.
120121
* To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally.
@@ -133,7 +134,7 @@
133134
134135
Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.
135136
136-
**Practical guidance**
137+
## Practical guidance
137138
138139
* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
139140
* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.

0 commit comments

Comments
 (0)