Skip to content

Commit 14df39f

Browse files
style fixes
1 parent c68b0ca commit 14df39f

File tree

3 files changed

+24
-25
lines changed

3 files changed

+24
-25
lines changed

guides/ipynb/quantization/overview.ipynb

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@
6565
" * **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.\n",
6666
" * **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.\n",
6767
"\n",
68-
"**Implementation notes**\n",
68+
"### Implementation notes\n",
6969
"\n",
7070
"* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
7171
"* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
@@ -95,21 +95,21 @@
9595
"import keras\n",
9696
"import numpy as np\n",
9797
"\n",
98-
"# Sample training data\n",
98+
"# Sample training data.\n",
9999
"x_train = keras.ops.array(np.random.rand(100, 10))\n",
100100
"y_train = keras.ops.array(np.random.rand(100, 1))\n",
101101
"\n",
102-
"# Build the model\n",
102+
"# Build the model.\n",
103103
"model = keras.Sequential([\n",
104104
" keras.layers.Dense(32, activation=\"relu\", input_shape=(10,)),\n",
105105
" keras.layers.Dense(1)\n",
106106
"])\n",
107107
"\n",
108-
"# Compile and fit the model\n",
108+
"# Compile and fit the model.\n",
109109
"model.compile(optimizer=\"adam\", loss=\"mean_squared_error\")\n",
110110
"model.fit(x_train, y_train, epochs=1, verbose=0)\n",
111111
"\n",
112-
"# Quantize the model\n",
112+
"# Quantize the model.\n",
113113
"model.quantize(\"int8\")"
114114
]
115115
},
@@ -140,7 +140,7 @@
140140
"layer = layers.Dense(32, activation=\"relu\", input_shape=input_shape)\n",
141141
"layer.build(input_shape)\n",
142142
"\n",
143-
"layer.quantize(\"int4\") # or \"int8\", \"float8\", etc."
143+
"layer.quantize(\"int4\") # Or \"int8\", \"float8\", etc."
144144
]
145145
},
146146
{
@@ -167,7 +167,7 @@
167167
"\n",
168168
"Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.\n",
169169
"\n",
170-
"**Practical guidance**\n",
170+
"## Practical guidance\n",
171171
"\n",
172172
"* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).\n",
173173
"* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device."

guides/md/quantization/overview.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
**Last modified:** 2025/10/09<br>
66
**Description:** Overview of quantization in Keras (int8, float8, int4, GPTQ).
77

8-
98
<img class="k-inline-icon" src="https://colab.research.google.com/img/colab_favicon.ico"/> [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/quantization/overview.ipynb) <span class="k-dot">•</span><img class="k-inline-icon" src="https://github.com/favicon.ico"/> [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/guides/quantization/overview.py)
109

1110
---
@@ -61,11 +60,11 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
6160
* **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.
6261
* **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.
6362

64-
> **Implementation notes**
65-
>
66-
> * For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
67-
> * Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
68-
> * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
63+
### Implementation notes
64+
65+
* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
66+
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
67+
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
6968

7069
---
7170

@@ -122,7 +121,7 @@ layer.build(input_shape)
122121
layer.quantize("int4") # or "int8", "float8", etc.
123122
```
124123

125-
**When to use layer-wise quantization**
124+
### When to use layer-wise quantization
126125

127126
* To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers.
128127
* To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally.
@@ -143,7 +142,7 @@ Any composite layers that are built from the above (for example, `MultiHeadAtten
143142

144143
Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.
145144

146-
> **Practical guidance**
147-
>
148-
> * For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
149-
> * Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.
145+
## Practical guidance
146+
147+
* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
148+
* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.

guides/quantization/overview.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@
4747
4848
* **How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.
4949
* **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.
50-
* **What to expect:** ~ smaller than FP32 (~ vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.
50+
* **What to expect:** ~8x smaller than FP32 (~4x vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.
5151
5252
* **`GPTQ` (weight-only 2/3/4/8 bits)**: *Second-order, post-training* method minimizing layer output error.
5353
@@ -57,7 +57,7 @@
5757
5858
**Implementation notes**
5959
60-
* For `int4`, Keras packs signed 4-bit values (range ≈ [8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
60+
* For `int4`, Keras packs signed 4-bit values (range ≈ [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
6161
* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
6262
* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
6363
@@ -78,21 +78,21 @@
7878
import keras
7979
import numpy as np
8080

81-
# Sample training data
81+
# Sample training data.
8282
x_train = keras.ops.array(np.random.rand(100, 10))
8383
y_train = keras.ops.array(np.random.rand(100, 1))
8484

85-
# Build the model
85+
# Build the model.
8686
model = keras.Sequential([
8787
keras.layers.Dense(32, activation="relu", input_shape=(10,)),
8888
keras.layers.Dense(1)
8989
])
9090

91-
# Compile and fit the model
91+
# Compile and fit the model.
9292
model.compile(optimizer="adam", loss="mean_squared_error")
9393
model.fit(x_train, y_train, epochs=1, verbose=0)
9494

95-
# Quantize the model
95+
# Quantize the model.
9696
model.quantize("int8")
9797

9898
"""
@@ -111,7 +111,7 @@
111111
layer = layers.Dense(32, activation="relu", input_shape=input_shape)
112112
layer.build(input_shape)
113113

114-
layer.quantize("int4") # or "int8", "float8", etc.
114+
layer.quantize("int4") # Or "int8", "float8", etc.
115115

116116
"""
117117
**When to use layer-wise quantization**

0 commit comments

Comments
 (0)