|
65 | 65 | " * **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.\n",
|
66 | 66 | " * **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.\n",
|
67 | 67 | "\n",
|
68 |
| - "**Implementation notes**\n", |
| 68 | + "### Implementation notes\n", |
69 | 69 | "\n",
|
70 | 70 | "* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
|
71 | 71 | "* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
|
|
95 | 95 | "import keras\n",
|
96 | 96 | "import numpy as np\n",
|
97 | 97 | "\n",
|
98 |
| - "# Sample training data\n", |
| 98 | + "# Sample training data.\n", |
99 | 99 | "x_train = keras.ops.array(np.random.rand(100, 10))\n",
|
100 | 100 | "y_train = keras.ops.array(np.random.rand(100, 1))\n",
|
101 | 101 | "\n",
|
102 |
| - "# Build the model\n", |
| 102 | + "# Build the model.\n", |
103 | 103 | "model = keras.Sequential([\n",
|
104 | 104 | " keras.layers.Dense(32, activation=\"relu\", input_shape=(10,)),\n",
|
105 | 105 | " keras.layers.Dense(1)\n",
|
106 | 106 | "])\n",
|
107 | 107 | "\n",
|
108 |
| - "# Compile and fit the model\n", |
| 108 | + "# Compile and fit the model.\n", |
109 | 109 | "model.compile(optimizer=\"adam\", loss=\"mean_squared_error\")\n",
|
110 | 110 | "model.fit(x_train, y_train, epochs=1, verbose=0)\n",
|
111 | 111 | "\n",
|
112 |
| - "# Quantize the model\n", |
| 112 | + "# Quantize the model.\n", |
113 | 113 | "model.quantize(\"int8\")"
|
114 | 114 | ]
|
115 | 115 | },
|
|
140 | 140 | "layer = layers.Dense(32, activation=\"relu\", input_shape=input_shape)\n",
|
141 | 141 | "layer.build(input_shape)\n",
|
142 | 142 | "\n",
|
143 |
| - "layer.quantize(\"int4\") # or \"int8\", \"float8\", etc." |
| 143 | + "layer.quantize(\"int4\") # Or \"int8\", \"float8\", etc." |
144 | 144 | ]
|
145 | 145 | },
|
146 | 146 | {
|
|
167 | 167 | "\n",
|
168 | 168 | "Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.\n",
|
169 | 169 | "\n",
|
170 |
| - "**Practical guidance**\n", |
| 170 | + "## Practical guidance\n", |
171 | 171 | "\n",
|
172 | 172 | "* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).\n",
|
173 | 173 | "* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device."
|
|
0 commit comments