|
21 | 21 | * Joint weight + activation PTQ in `int4`, `int8`, and `float8`.
|
22 | 22 | * Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).
|
23 | 23 |
|
24 |
| -**Terminology** |
| 24 | +### Terminology |
| 25 | +
|
25 | 26 | * *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.
|
26 | 27 | * *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.
|
27 | 28 | * *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).
|
|
55 | 56 | * **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.
|
56 | 57 | * **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.
|
57 | 58 |
|
58 |
| -**Implementation notes** |
| 59 | +### Implementation notes |
59 | 60 |
|
60 | 61 | * For `int4`, Keras packs signed 4-bit values (range ≈ [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
|
61 | 62 | * Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
|
|
114 | 115 | layer.quantize("int4") # Or "int8", "float8", etc.
|
115 | 116 |
|
116 | 117 | """
|
117 |
| -**When to use layer-wise quantization** |
| 118 | +### When to use layer-wise quantization |
118 | 119 |
|
119 | 120 | * To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers.
|
120 | 121 | * To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally.
|
|
133 | 134 |
|
134 | 135 | Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.
|
135 | 136 |
|
136 |
| -**Practical guidance** |
| 137 | +## Practical guidance |
137 | 138 |
|
138 | 139 | * For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
|
139 | 140 | * Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.
|
|
0 commit comments