From 4970e3a9cbbbccc7d917d29aae711b3a46fd35cc Mon Sep 17 00:00:00 2001 From: Jyotinder Singh <33001894+JyotinderSingh@users.noreply.github.com> Date: Thu, 9 Oct 2025 11:11:05 +0530 Subject: [PATCH 1/8] Adds quantization documentation --- guides/ipynb/quantization/overview.ipynb | 190 +++++++++++++++++++++++ guides/md/quantization/overview.md | 148 ++++++++++++++++++ guides/quantization/overview.py | 140 +++++++++++++++++ scripts/guides_master.py | 4 + 4 files changed, 482 insertions(+) create mode 100644 guides/ipynb/quantization/overview.ipynb create mode 100644 guides/md/quantization/overview.md create mode 100644 guides/quantization/overview.py diff --git a/guides/ipynb/quantization/overview.ipynb b/guides/ipynb/quantization/overview.ipynb new file mode 100644 index 0000000000..657f0b0b35 --- /dev/null +++ b/guides/ipynb/quantization/overview.ipynb @@ -0,0 +1,190 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "35a7da8b", + "metadata": {}, + "source": [ + "# Quantization in Keras\n", + "Author: [Jyotinder Singh](https://x.com/Jyotinder_Singh)\n", + "\n", + "Date created: 2025/10/09\n", + "\n", + "Last modified: 2025/10/09\n", + "\n", + "Description: Overview of quantization in Keras (int8, float8, int4, GPTQ).\n", + "\n", + "Accelerator: GPU\n", + "\n", + "## Introduction\n", + "\n", + "Modern large models are often **memory- and bandwidth-bound**: most inference time is spent moving tensors between memory and compute units rather than doing math. Quantization reduces the number of bits used to represent the model's weights and (optionally) activations, which:\n", + "\n", + "* Shrinks model size and VRAM/RAM footprint.\n", + "* Increases effective memory bandwidth (fewer bytes per value).\n", + "* Can improve throughput and sometimes latency on supporting hardware with low-precision kernels.\n", + "\n", + "Keras provides first-class **post-training quantization (PTQ)** workflows which support pretrained models and expose a uniform API at both the model and layer level.\n", + "\n", + "At a high level, Keras supports:\n", + "\n", + "* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.\n", + "* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).\n", + "\n", + "**Terminology**\n", + "* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.\n", + "* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.\n", + "* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).\n", + "\n", + "\n", + "## Quantization Modes\n", + "\n", + "Keras currently focuses on the following numeric formats. Each mode can be applied selectively to layers or to the whole model via the same API.\n", + "\n", + "* **`int8` (8-bit integer)**: **joint weight + activation** PTQ.\n", + "\n", + " * **How it works:** Values are linearly mapped to 8-bit integers with per-channel scales. Activations are calibrated using dynamic quantization (see note below).\n", + " * **Why use it:** Good accuracy for many architectures; broad hardware support.\n", + " * **What to expect:** ~4x smaller than FP32 parameters (~2x vs FP16) and lower activation bandwidth, with small accuracy loss on many tasks. Throughput gains depend on kernel availability and memory bandwidth.\n", + "\n", + "* **`float8` (FP8: E4M3 / E5M2 variants)**: Low-precision floating-point useful for training and inference on FP8-capable hardware.\n", + "\n", + " * **How it works:** Values are quantized to FP8 with a dynamic scale. Fused FP8 kernels on supported devices yield speedups.\n", + " * **Why use it:** Mixed-precision training/inference with hardware acceleration while keeping floating-point semantics (since underflow/overflow characteristics differ from int).\n", + " * **What to expect:** Competitive speed and memory reductions where FP8 kernels are available; accuracy varies by model, but is usually acceptable for most tasks.\n", + "\n", + "* **`int4`**: Ultra-low-bit **weights** for aggressive compression; activations remain in higher precision (int8).\n", + "\n", + " * **How it works:** Two signed 4-bit \"nibbles\" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.\n", + " * **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.\n", + " * **What to expect:** ~8× smaller than FP32 (~4× vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.\n", + "\n", + "* **`GPTQ` (weight-only 2/3/4/8 bits)**: *Second-order, post-training* method minimizing layer output error.\n", + "\n", + " * **How it works (brief):** For each weight block (group), GPTQ solves a local least-squares problem using a Hessian approximation built from a small calibration set, then quantizes to low bit-width. The result is a packed weight tensor plus per-group parameters (e.g., scales).\n", + " * **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.\n", + " * **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.\n", + "\n", + "### Implementation notes\n", + "\n", + "* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n", + "* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n", + "* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.\n", + "\n", + "## Quantizing Keras Models\n", + "\n", + "Quantization is applied explicitly after layers or models are built. The API is designed to be predictable: you call quantize, the graph is rewritten, the weights are replaced, and you can immediately run inference or save the model.\n", + "\n", + "Typical workflow:\n", + "\n", + "1. **Build / load your FP model.** Train if needed. Ensure `build()` or a forward pass has materialized weights.\n", + "2. **(GPTQ only)** For GPTQ, Keras runs a short calibration pass to collect activation statistics. You will need to provide a small, representative dataset for this purpose.\n", + "3. **Invoke quantization.** Call `model.quantize(\"\")` or `layer.quantize(\"\")` with `\"int8\"`, `\"int4\"`, `\"float8\"`, or `\"gptq\"` (weight-only).\n", + "4. **Use or save.** Run inference, or `model.save(...)`. Quantization state (packed weights, scales, metadata) is preserved on save/load.\n", + "\n", + "### Model Quantization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9944077", + "metadata": {}, + "outputs": [], + "source": [ + "import keras\n", + "import numpy as np\n", + "\n", + "# Sample training data.\n", + "x_train = keras.ops.array(np.random.rand(100, 10))\n", + "y_train = keras.ops.array(np.random.rand(100, 1))\n", + "\n", + "# Build the model.\n", + "model = keras.Sequential([\n", + " keras.layers.Dense(32, activation=\"relu\", input_shape=(10,)),\n", + " keras.layers.Dense(1)\n", + "])\n", + "\n", + "# Compile and fit the model.\n", + "model.compile(optimizer=\"adam\", loss=\"mean_squared_error\")\n", + "model.fit(x_train, y_train, epochs=1, verbose=0)\n", + "\n", + "# Quantize the model.\n", + "model.quantize(\"int8\")" + ] + }, + { + "cell_type": "markdown", + "id": "a9b1d974", + "metadata": {}, + "source": [ + "**What this does:** Quantizes the weights of the supported layers, and re-wires their forward paths to be compatible with the quantized kernels and quantization scales.\n", + "\n", + "**Note**: Throughput gains depend on backend/hardware kernels; in cases where kernels fall back to dequantized matmul, you still get memory savings but smaller speedups.\n", + "\n", + "### Layer-wise Quantization\n", + "\n", + "The Keras quantization framework allows you to quantize each layer separately, without having to quantize the entire model using the same unified API." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0df2aa1a", + "metadata": {}, + "outputs": [], + "source": [ + "from keras import layers\n", + "\n", + "input_shape = (10,)\n", + "layer = layers.Dense(32, activation=\"relu\", input_shape=input_shape)\n", + "layer.build(input_shape)\n", + "\n", + "layer.quantize(\"int4\") # Or \"int8\", \"float8\", etc." + ] + }, + { + "cell_type": "markdown", + "id": "249deef4", + "metadata": {}, + "source": [ + "**When to use layer-wise quantization**\n", + "\n", + "* To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers.\n", + "* To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally.\n", + "* Always validate on a small eval set after each step; mixing precisions across residual connections can shift distributions.\n", + "\n", + "## Layer & model coverage\n", + "\n", + "Keras supports the following core layers in its quantization framework:\n", + "\n", + "* `Dense`\n", + "* `EinsumDense`\n", + "* `Embedding` (available in KerasHub)\n", + "* `ReversibleEmbedding` (available in KerasHub)\n", + "\n", + "Any composite layers that are built from the above (for example, `MultiHeadAttention`, `GroupedQueryAttention`, feed-forward blocks in Transformers) inherit quantization support by construction. This covers the majority of modern encoder-only and decoder-only Transformer architectures.\n", + "\n", + "Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.\n", + "\n", + "## Practical guidance\n", + "\n", + "* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).\n", + "* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device." + ] + }, + { + "cell_type": "markdown", + "id": "cce23bb3", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/guides/md/quantization/overview.md b/guides/md/quantization/overview.md new file mode 100644 index 0000000000..2d82377dea --- /dev/null +++ b/guides/md/quantization/overview.md @@ -0,0 +1,148 @@ +# Quantization in Keras + +**Author:** [Jyotinder Singh](https://x.com/Jyotinder_Singh)
+**Date created:** 2025/10/09
+**Last modified:** 2025/10/09
+**Description:** Overview of quantization in Keras (int8, float8, int4, GPTQ). + + [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/quantization/overview.ipynb) [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/guides/quantization/overview.py) + +--- + +## Introduction + +Modern large models are often **memory- and bandwidth-bound**: most inference time is spent moving tensors between memory and compute units rather than doing math. Quantization reduces the number of bits used to represent the model's weights and (optionally) activations, which: + +* Shrinks model size and VRAM/RAM footprint. +* Increases effective memory bandwidth (fewer bytes per value). +* Can improve throughput and sometimes latency on supporting hardware with low-precision kernels. + +Keras provides first-class **post-training quantization (PTQ)** workflows which support pretrained models and expose a uniform API at both the model and layer level. + +At a high level, Keras supports: + +* Joint weight + activation PTQ in `int4`, `int8`, and `float8`. +* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs). + +> **Terminology** +> +> * *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale. +> * *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor. +> * *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value). + +--- + +## Quantization Modes + +Keras currently focuses on the following numeric formats. Each mode can be applied selectively to layers or to the whole model via the same API. + +* **`int8` (8-bit integer)**: **joint weight + activation** PTQ. + + * **How it works:** Values are linearly mapped to 8-bit integers with per-channel scales. Activations are calibrated using dynamic quantization (see note below). + * **Why use it:** Good accuracy for many architectures; broad hardware support. + * **What to expect:** ~4x smaller than FP32 parameters (~2x vs FP16) and lower activation bandwidth, with small accuracy loss on many tasks. Throughput gains depend on kernel availability and memory bandwidth. + +* **`float8` (FP8: E4M3 / E5M2 variants)**: Low-precision floating-point useful for training and inference on FP8-capable hardware. + + * **How it works:** Values are quantized to FP8 with a dynamic scale. Fused FP8 kernels on supported devices yield speedups. + * **Why use it:** Mixed-precision training/inference with hardware acceleration while keeping floating-point semantics (since underflow/overflow characteristics differ from int). + * **What to expect:** Competitive speed and memory reductions where FP8 kernels are available; accuracy varies by model, but is usually acceptable for most tasks. + +* **`int4`**: Ultra-low-bit **weights** for aggressive compression; activations remain in higher precision (int8). + + * **How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul. + * **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling. + * **What to expect:** ~8× smaller than FP32 (~4× vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models. + +* **`GPTQ` (weight-only 2/3/4/8 bits)**: *Second-order, post-training* method minimizing layer output error. + + * **How it works (brief):** For each weight block (group), GPTQ solves a local least-squares problem using a Hessian approximation built from a small calibration set, then quantizes to low bit-width. The result is a packed weight tensor plus per-group parameters (e.g., scales). + * **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression. + * **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples. + +### Implementation notes + +* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. +* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases. +* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead. + +--- + +## Quantizing Keras Models + +Quantization is applied explicitly after layers or models are built. The API is designed to be predictable: you call quantize, the graph is rewritten, the weights are replaced, and you can immediately run inference or save the model. + +Typical workflow: + +1. **Build / load your FP model.** Train if needed. Ensure `build()` or a forward pass has materialized weights. +2. **(GPTQ only)** For GPTQ, Keras runs a short calibration pass to collect activation statistics. You will need to provide a small, representative dataset for this purpose. +3. **Invoke quantization.** Call `model.quantize("")` or `layer.quantize("")` with `"int8"`, `"int4"`, `"float8"`, or `"gptq"` (weight-only). +4. **Use or save.** Run inference, or `model.save(...)`. Quantization state (packed weights, scales, metadata) is preserved on save/load. + +### Model Quantization + +```python +import keras +import numpy as np + +# Sample training data +x_train = keras.ops.array(np.random.rand(100, 10)) +y_train = keras.ops.array(np.random.rand(100, 1)) + +# Build the model +model = keras.Sequential([ + keras.layers.Dense(32, activation="relu", input_shape=(10,)), + keras.layers.Dense(1) +]) + +# Compile and fit the model +model.compile(optimizer="adam", loss="mean_squared_error") +model.fit(x_train, y_train, epochs=1, verbose=0) + +# Quantize the model +model.quantize("int8") +``` + +**What this does:** Quantizes the weights of the supported layers, and re-wires their forward paths to be compatible with the quantized kernels and quantization scales. + +**Note**: Throughput gains depend on backend/hardware kernels; in cases where kernels fall back to dequantized matmul, you still get memory savings but smaller speedups. + +### Layer-wise Quantization + +The Keras quantization framework allows you to quantize each layer separately, without having to quantize the entire model using the same unified API. + +```python +from keras import layers + +input_shape = (10,) +layer = layers.Dense(32, activation="relu", input_shape=input_shape) +layer.build(input_shape) + +layer.quantize("int4") # or "int8", "float8", etc. +``` + +### When to use layer-wise quantization + +* To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers. +* To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally. +* Always validate on a small eval set after each step; mixing precisions across residual connections can shift distributions. + +--- + +## Layer & model coverage + +Keras supports the following core layers in its quantization framework: + +* `Dense` +* `EinsumDense` +* `Embedding` (available in KerasHub) +* `ReversibleEmbedding` (available in KerasHub) + +Any composite layers that are built from the above (for example, `MultiHeadAttention`, `GroupedQueryAttention`, feed-forward blocks in Transformers) inherit quantization support by construction. This covers the majority of modern encoder-only and decoder-only Transformer architectures. + +Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code. + +## Practical guidance + +* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention). +* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device. diff --git a/guides/quantization/overview.py b/guides/quantization/overview.py new file mode 100644 index 0000000000..7092956201 --- /dev/null +++ b/guides/quantization/overview.py @@ -0,0 +1,140 @@ +""" +Title: Quantization in Keras +Author: [Jyotinder Singh](https://x.com/Jyotinder_Singh)
+Date created: 2025/10/09
+Last modified: 2025/10/09
+Description: Overview of quantization in Keras (int8, float8, int4, GPTQ). +Accelerator: GPU + +## Introduction + +Modern large models are often **memory- and bandwidth-bound**: most inference time is spent moving tensors between memory and compute units rather than doing math. Quantization reduces the number of bits used to represent the model's weights and (optionally) activations, which: + +* Shrinks model size and VRAM/RAM footprint. +* Increases effective memory bandwidth (fewer bytes per value). +* Can improve throughput and sometimes latency on supporting hardware with low-precision kernels. + +Keras provides first-class **post-training quantization (PTQ)** workflows which support pretrained models and expose a uniform API at both the model and layer level. + +At a high level, Keras supports: + +* Joint weight + activation PTQ in `int4`, `int8`, and `float8`. +* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs). + +**Terminology** +* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale. +* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor. +* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value). + + +## Quantization Modes + +Keras currently focuses on the following numeric formats. Each mode can be applied selectively to layers or to the whole model via the same API. + +* **`int8` (8-bit integer)**: **joint weight + activation** PTQ. + + * **How it works:** Values are linearly mapped to 8-bit integers with per-channel scales. Activations are calibrated using dynamic quantization (see note below). + * **Why use it:** Good accuracy for many architectures; broad hardware support. + * **What to expect:** ~4x smaller than FP32 parameters (~2x vs FP16) and lower activation bandwidth, with small accuracy loss on many tasks. Throughput gains depend on kernel availability and memory bandwidth. + +* **`float8` (FP8: E4M3 / E5M2 variants)**: Low-precision floating-point useful for training and inference on FP8-capable hardware. + + * **How it works:** Values are quantized to FP8 with a dynamic scale. Fused FP8 kernels on supported devices yield speedups. + * **Why use it:** Mixed-precision training/inference with hardware acceleration while keeping floating-point semantics (since underflow/overflow characteristics differ from int). + * **What to expect:** Competitive speed and memory reductions where FP8 kernels are available; accuracy varies by model, but is usually acceptable for most tasks. + +* **`int4`**: Ultra-low-bit **weights** for aggressive compression; activations remain in higher precision (int8). + + * **How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul. + * **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling. + * **What to expect:** ~8x smaller than FP32 (~4x vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models. + +* **`GPTQ` (weight-only 2/3/4/8 bits)**: *Second-order, post-training* method minimizing layer output error. + + * **How it works (brief):** For each weight block (group), GPTQ solves a local least-squares problem using a Hessian approximation built from a small calibration set, then quantizes to low bit-width. The result is a packed weight tensor plus per-group parameters (e.g., scales). + * **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression. + * **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples. + +**Implementation notes** + +* For `int4`, Keras packs signed 4-bit values (range ≈ [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. +* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases. +* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead. + +## Quantizing Keras Models + +Quantization is applied explicitly after layers or models are built. The API is designed to be predictable: you call quantize, the graph is rewritten, the weights are replaced, and you can immediately run inference or save the model. + +Typical workflow: + +1. **Build / load your FP model.** Train if needed. Ensure `build()` or a forward pass has materialized weights. +2. **(GPTQ only)** For GPTQ, Keras runs a short calibration pass to collect activation statistics. You will need to provide a small, representative dataset for this purpose. +3. **Invoke quantization.** Call `model.quantize("")` or `layer.quantize("")` with `"int8"`, `"int4"`, `"float8"`, or `"gptq"` (weight-only). +4. **Use or save.** Run inference, or `model.save(...)`. Quantization state (packed weights, scales, metadata) is preserved on save/load. + +### Model Quantization +""" + +import keras +import numpy as np + +# Sample training data. +x_train = keras.ops.array(np.random.rand(100, 10)) +y_train = keras.ops.array(np.random.rand(100, 1)) + +# Build the model. +model = keras.Sequential([ + keras.layers.Dense(32, activation="relu", input_shape=(10,)), + keras.layers.Dense(1) +]) + +# Compile and fit the model. +model.compile(optimizer="adam", loss="mean_squared_error") +model.fit(x_train, y_train, epochs=1, verbose=0) + +# Quantize the model. +model.quantize("int8") + +""" +**What this does:** Quantizes the weights of the supported layers, and re-wires their forward paths to be compatible with the quantized kernels and quantization scales. + +**Note**: Throughput gains depend on backend/hardware kernels; in cases where kernels fall back to dequantized matmul, you still get memory savings but smaller speedups. + +### Layer-wise Quantization + +The Keras quantization framework allows you to quantize each layer separately, without having to quantize the entire model using the same unified API. +""" + +from keras import layers + +input_shape = (10,) +layer = layers.Dense(32, activation="relu", input_shape=input_shape) +layer.build(input_shape) + +layer.quantize("int4") # Or "int8", "float8", etc. + +""" +**When to use layer-wise quantization** + +* To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers. +* To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally. +* Always validate on a small eval set after each step; mixing precisions across residual connections can shift distributions. + +## Layer & model coverage + +Keras supports the following core layers in its quantization framework: + +* `Dense` +* `EinsumDense` +* `Embedding` (available in KerasHub) +* `ReversibleEmbedding` (available in KerasHub) + +Any composite layers that are built from the above (for example, `MultiHeadAttention`, `GroupedQueryAttention`, feed-forward blocks in Transformers) inherit quantization support by construction. This covers the majority of modern encoder-only and decoder-only Transformer architectures. + +Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code. + +**Practical guidance** + +* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention). +* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device. +""" \ No newline at end of file diff --git a/scripts/guides_master.py b/scripts/guides_master.py index f7646d777c..62d5a697b5 100644 --- a/scripts/guides_master.py +++ b/scripts/guides_master.py @@ -123,6 +123,10 @@ "path": "orbax_checkpoint", "title": "Orbax Checkpointing in Keras", }, + { + "path": "quantization/overview", + "title": "Quantization in Keras", + }, # { # "path": "preprocessing_layers", # "title": "Working with preprocessing layers", From 23fba6191cc55f659a8b85b5dd00c337b61374c7 Mon Sep 17 00:00:00 2001 From: Jyotinder Singh <33001894+JyotinderSingh@users.noreply.github.com> Date: Thu, 9 Oct 2025 17:10:01 +0530 Subject: [PATCH 2/8] Update guides/md/quantization/overview.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- guides/md/quantization/overview.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/guides/md/quantization/overview.md b/guides/md/quantization/overview.md index 2d82377dea..442dbb2828 100644 --- a/guides/md/quantization/overview.md +++ b/guides/md/quantization/overview.md @@ -85,21 +85,21 @@ Typical workflow: import keras import numpy as np -# Sample training data +# Sample training data. x_train = keras.ops.array(np.random.rand(100, 10)) y_train = keras.ops.array(np.random.rand(100, 1)) -# Build the model +# Build the model. model = keras.Sequential([ keras.layers.Dense(32, activation="relu", input_shape=(10,)), keras.layers.Dense(1) ]) -# Compile and fit the model +# Compile and fit the model. model.compile(optimizer="adam", loss="mean_squared_error") model.fit(x_train, y_train, epochs=1, verbose=0) -# Quantize the model +# Quantize the model. model.quantize("int8") ``` From a5c61412aa4df1421377afcfb175e19fa07366a2 Mon Sep 17 00:00:00 2001 From: Jyotinder Singh <33001894+JyotinderSingh@users.noreply.github.com> Date: Thu, 9 Oct 2025 17:10:12 +0530 Subject: [PATCH 3/8] Update guides/md/quantization/overview.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- guides/md/quantization/overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/guides/md/quantization/overview.md b/guides/md/quantization/overview.md index 442dbb2828..966f8bf4db 100644 --- a/guides/md/quantization/overview.md +++ b/guides/md/quantization/overview.md @@ -118,7 +118,7 @@ input_shape = (10,) layer = layers.Dense(32, activation="relu", input_shape=input_shape) layer.build(input_shape) -layer.quantize("int4") # or "int8", "float8", etc. +layer.quantize("int4") # Or "int8", "float8", etc. ``` ### When to use layer-wise quantization From 90eb08a598faedeed0c3e425d55e9d56d551e302 Mon Sep 17 00:00:00 2001 From: Jyotinder Singh <33001894+JyotinderSingh@users.noreply.github.com> Date: Thu, 9 Oct 2025 17:12:59 +0530 Subject: [PATCH 4/8] improves consistency --- guides/quantization/overview.py | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/guides/quantization/overview.py b/guides/quantization/overview.py index 7092956201..9ed0a4d28c 100644 --- a/guides/quantization/overview.py +++ b/guides/quantization/overview.py @@ -21,7 +21,8 @@ * Joint weight + activation PTQ in `int4`, `int8`, and `float8`. * Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs). -**Terminology** +### Terminology + * *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale. * *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor. * *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value). @@ -55,7 +56,7 @@ * **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression. * **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples. -**Implementation notes** +### Implementation notes * For `int4`, Keras packs signed 4-bit values (range ≈ [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. * Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases. @@ -114,7 +115,7 @@ layer.quantize("int4") # Or "int8", "float8", etc. """ -**When to use layer-wise quantization** +### When to use layer-wise quantization * To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers. * To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally. @@ -133,7 +134,7 @@ Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code. -**Practical guidance** +## Practical guidance * For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention). * Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device. From 2d61c85949ad6c770f66a4b8baa0b5ad9fe644c8 Mon Sep 17 00:00:00 2001 From: Jyotinder Singh <33001894+JyotinderSingh@users.noreply.github.com> Date: Thu, 9 Oct 2025 17:20:46 +0530 Subject: [PATCH 5/8] improves consistency --- guides/ipynb/quantization/overview.ipynb | 4 ++-- guides/md/quantization/overview.md | 6 ++++-- guides/quantization/overview.py | 2 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/guides/ipynb/quantization/overview.ipynb b/guides/ipynb/quantization/overview.ipynb index 657f0b0b35..e4c3aafc56 100644 --- a/guides/ipynb/quantization/overview.ipynb +++ b/guides/ipynb/quantization/overview.ipynb @@ -57,7 +57,7 @@ "\n", " * **How it works:** Two signed 4-bit \"nibbles\" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.\n", " * **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.\n", - " * **What to expect:** ~8× smaller than FP32 (~4× vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.\n", + " * **What to expect:** ~8x smaller than FP32 (~4x vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.\n", "\n", "* **`GPTQ` (weight-only 2/3/4/8 bits)**: *Second-order, post-training* method minimizing layer output error.\n", "\n", @@ -67,7 +67,7 @@ "\n", "### Implementation notes\n", "\n", - "* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n", + "* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n", "* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n", "* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.\n", "\n", diff --git a/guides/md/quantization/overview.md b/guides/md/quantization/overview.md index 966f8bf4db..6eb21d7eb5 100644 --- a/guides/md/quantization/overview.md +++ b/guides/md/quantization/overview.md @@ -52,7 +52,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli * **How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul. * **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling. - * **What to expect:** ~8× smaller than FP32 (~4× vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models. + * **What to expect:** ~8x smaller than FP32 (~4x vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models. * **`GPTQ` (weight-only 2/3/4/8 bits)**: *Second-order, post-training* method minimizing layer output error. @@ -62,7 +62,7 @@ Keras currently focuses on the following numeric formats. Each mode can be appli ### Implementation notes -* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. +* For `int4`, Keras packs signed 4-bit values (range = [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. * Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases. * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead. @@ -142,6 +142,8 @@ Any composite layers that are built from the above (for example, `MultiHeadAtten Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code. +--- + ## Practical guidance * For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention). diff --git a/guides/quantization/overview.py b/guides/quantization/overview.py index 9ed0a4d28c..b03e90fb2c 100644 --- a/guides/quantization/overview.py +++ b/guides/quantization/overview.py @@ -58,7 +58,7 @@ ### Implementation notes -* For `int4`, Keras packs signed 4-bit values (range ≈ [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. +* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. * Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases. * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead. From 791043fc1e29aec3fc81922ffb47174eb2f2a411 Mon Sep 17 00:00:00 2001 From: Jyotinder Singh <33001894+JyotinderSingh@users.noreply.github.com> Date: Fri, 10 Oct 2025 14:00:28 +0530 Subject: [PATCH 6/8] address reviews --- .../overview.ipynb => quantization_overview.ipynb} | 2 +- .../md/{quantization/overview.md => quantization_overview.md} | 4 ++-- guides/{quantization/overview.py => quantization_overview.py} | 2 +- scripts/guides_master.py | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) rename guides/ipynb/{quantization/overview.ipynb => quantization_overview.ipynb} (99%) rename guides/md/{quantization/overview.md => quantization_overview.md} (98%) rename guides/{quantization/overview.py => quantization_overview.py} (99%) diff --git a/guides/ipynb/quantization/overview.ipynb b/guides/ipynb/quantization_overview.ipynb similarity index 99% rename from guides/ipynb/quantization/overview.ipynb rename to guides/ipynb/quantization_overview.ipynb index e4c3aafc56..c7ab05d45d 100644 --- a/guides/ipynb/quantization/overview.ipynb +++ b/guides/ipynb/quantization_overview.ipynb @@ -160,7 +160,7 @@ "\n", "* `Dense`\n", "* `EinsumDense`\n", - "* `Embedding` (available in KerasHub)\n", + "* `Embedding`\n", "* `ReversibleEmbedding` (available in KerasHub)\n", "\n", "Any composite layers that are built from the above (for example, `MultiHeadAttention`, `GroupedQueryAttention`, feed-forward blocks in Transformers) inherit quantization support by construction. This covers the majority of modern encoder-only and decoder-only Transformer architectures.\n", diff --git a/guides/md/quantization/overview.md b/guides/md/quantization_overview.md similarity index 98% rename from guides/md/quantization/overview.md rename to guides/md/quantization_overview.md index 6eb21d7eb5..0d676ad431 100644 --- a/guides/md/quantization/overview.md +++ b/guides/md/quantization_overview.md @@ -5,7 +5,7 @@ **Last modified:** 2025/10/09
**Description:** Overview of quantization in Keras (int8, float8, int4, GPTQ). - [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/quantization/overview.ipynb) [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/guides/quantization/overview.py) + [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/quantization_overview.ipynb) [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/guides/quantization_overview.py) --- @@ -135,7 +135,7 @@ Keras supports the following core layers in its quantization framework: * `Dense` * `EinsumDense` -* `Embedding` (available in KerasHub) +* `Embedding` * `ReversibleEmbedding` (available in KerasHub) Any composite layers that are built from the above (for example, `MultiHeadAttention`, `GroupedQueryAttention`, feed-forward blocks in Transformers) inherit quantization support by construction. This covers the majority of modern encoder-only and decoder-only Transformer architectures. diff --git a/guides/quantization/overview.py b/guides/quantization_overview.py similarity index 99% rename from guides/quantization/overview.py rename to guides/quantization_overview.py index b03e90fb2c..8fb09cad14 100644 --- a/guides/quantization/overview.py +++ b/guides/quantization_overview.py @@ -127,7 +127,7 @@ * `Dense` * `EinsumDense` -* `Embedding` (available in KerasHub) +* `Embedding` * `ReversibleEmbedding` (available in KerasHub) Any composite layers that are built from the above (for example, `MultiHeadAttention`, `GroupedQueryAttention`, feed-forward blocks in Transformers) inherit quantization support by construction. This covers the majority of modern encoder-only and decoder-only Transformer architectures. diff --git a/scripts/guides_master.py b/scripts/guides_master.py index 62d5a697b5..1b82d10d92 100644 --- a/scripts/guides_master.py +++ b/scripts/guides_master.py @@ -124,7 +124,7 @@ "title": "Orbax Checkpointing in Keras", }, { - "path": "quantization/overview", + "path": "quantization_overview", "title": "Quantization in Keras", }, # { From 6b9cc1839d408a463fe95d7626eeb717dd97f880 Mon Sep 17 00:00:00 2001 From: Jyotinder Singh <33001894+JyotinderSingh@users.noreply.github.com> Date: Fri, 10 Oct 2025 14:19:33 +0530 Subject: [PATCH 7/8] improve formatting and add missing note --- guides/ipynb/quantization_overview.ipynb | 143 +++++++++++++++-------- guides/md/quantization_overview.md | 45 ++++--- guides/quantization_overview.py | 35 ++++-- 3 files changed, 147 insertions(+), 76 deletions(-) diff --git a/guides/ipynb/quantization_overview.ipynb b/guides/ipynb/quantization_overview.ipynb index c7ab05d45d..5d3f43da80 100644 --- a/guides/ipynb/quantization_overview.ipynb +++ b/guides/ipynb/quantization_overview.ipynb @@ -2,20 +2,24 @@ "cells": [ { "cell_type": "markdown", - "id": "35a7da8b", - "metadata": {}, + "metadata": { + "colab_type": "text" + }, "source": [ "# Quantization in Keras\n", - "Author: [Jyotinder Singh](https://x.com/Jyotinder_Singh)\n", - "\n", - "Date created: 2025/10/09\n", - "\n", - "Last modified: 2025/10/09\n", - "\n", - "Description: Overview of quantization in Keras (int8, float8, int4, GPTQ).\n", - "\n", - "Accelerator: GPU\n", "\n", + "**Author:** [Jyotinder Singh](https://x.com/Jyotinder_Singh)
\n", + "**Date created:** 2025/10/09
\n", + "**Last modified:** 2025/10/09
\n", + "**Description:** Overview of quantization in Keras (int8, float8, int4, GPTQ)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ "## Introduction\n", "\n", "Modern large models are often **memory- and bandwidth-bound**: most inference time is spent moving tensors between memory and compute units rather than doing math. Quantization reduces the number of bits used to represent the model's weights and (optionally) activations, which:\n", @@ -31,12 +35,19 @@ "* Joint weight + activation PTQ in `int4`, `int8`, and `float8`.\n", "* Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs).\n", "\n", - "**Terminology**\n", + "### Terminology\n", + "\n", "* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale.\n", "* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor.\n", - "* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value).\n", - "\n", - "\n", + "* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ "## Quantization Modes\n", "\n", "Keras currently focuses on the following numeric formats. Each mode can be applied selectively to layers or to the whole model via the same API.\n", @@ -67,10 +78,18 @@ "\n", "### Implementation notes\n", "\n", - "* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n", - "* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n", - "* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.\n", - "\n", + "* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges.\n", + "* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n", + "* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n", + "* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ "## Quantizing Keras Models\n", "\n", "Quantization is applied explicitly after layers or models are built. The API is designed to be predictable: you call quantize, the graph is rewritten, the weights are replaced, and you can immediately run inference or save the model.\n", @@ -87,9 +106,10 @@ }, { "cell_type": "code", - "execution_count": null, - "id": "d9944077", - "metadata": {}, + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, "outputs": [], "source": [ "import keras\n", @@ -100,10 +120,13 @@ "y_train = keras.ops.array(np.random.rand(100, 1))\n", "\n", "# Build the model.\n", - "model = keras.Sequential([\n", - " keras.layers.Dense(32, activation=\"relu\", input_shape=(10,)),\n", - " keras.layers.Dense(1)\n", - "])\n", + "model = keras.Sequential(\n", + " [\n", + " keras.Input(shape=(10,)),\n", + " keras.layers.Dense(32, activation=\"relu\"),\n", + " keras.layers.Dense(1),\n", + " ]\n", + ")\n", "\n", "# Compile and fit the model.\n", "model.compile(optimizer=\"adam\", loss=\"mean_squared_error\")\n", @@ -115,8 +138,9 @@ }, { "cell_type": "markdown", - "id": "a9b1d974", - "metadata": {}, + "metadata": { + "colab_type": "text" + }, "source": [ "**What this does:** Quantizes the weights of the supported layers, and re-wires their forward paths to be compatible with the quantized kernels and quantization scales.\n", "\n", @@ -129,15 +153,16 @@ }, { "cell_type": "code", - "execution_count": null, - "id": "0df2aa1a", - "metadata": {}, + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, "outputs": [], "source": [ "from keras import layers\n", "\n", "input_shape = (10,)\n", - "layer = layers.Dense(32, activation=\"relu\", input_shape=input_shape)\n", + "layer = layers.Dense(32, activation=\"relu\")\n", "layer.build(input_shape)\n", "\n", "layer.quantize(\"int4\") # Or \"int8\", \"float8\", etc." @@ -145,15 +170,23 @@ }, { "cell_type": "markdown", - "id": "249deef4", - "metadata": {}, + "metadata": { + "colab_type": "text" + }, "source": [ - "**When to use layer-wise quantization**\n", + "### When to use layer-wise quantization\n", "\n", "* To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers.\n", "* To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally.\n", - "* Always validate on a small eval set after each step; mixing precisions across residual connections can shift distributions.\n", - "\n", + "* Always validate on a small eval set after each step; mixing precisions across residual connections can shift distributions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ "## Layer & model coverage\n", "\n", "Keras supports the following core layers in its quantization framework:\n", @@ -165,26 +198,42 @@ "\n", "Any composite layers that are built from the above (for example, `MultiHeadAttention`, `GroupedQueryAttention`, feed-forward blocks in Transformers) inherit quantization support by construction. This covers the majority of modern encoder-only and decoder-only Transformer architectures.\n", "\n", - "Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.\n", + "Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it\u2014without changing your training code.\n", "\n", "## Practical guidance\n", "\n", "* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).\n", "* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device." ] - }, - { - "cell_type": "markdown", - "id": "cce23bb3", - "metadata": {}, - "source": [] } ], "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "quantization_overview", + "private_outputs": false, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.0" } }, "nbformat": 4, - "nbformat_minor": 5 -} + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/guides/md/quantization_overview.md b/guides/md/quantization_overview.md index 0d676ad431..c08aa5518a 100644 --- a/guides/md/quantization_overview.md +++ b/guides/md/quantization_overview.md @@ -5,10 +5,12 @@ **Last modified:** 2025/10/09
**Description:** Overview of quantization in Keras (int8, float8, int4, GPTQ). + [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/quantization_overview.ipynb) [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/guides/quantization_overview.py) ---- + +--- ## Introduction Modern large models are often **memory- and bandwidth-bound**: most inference time is spent moving tensors between memory and compute units rather than doing math. Quantization reduces the number of bits used to represent the model's weights and (optionally) activations, which: @@ -24,14 +26,13 @@ At a high level, Keras supports: * Joint weight + activation PTQ in `int4`, `int8`, and `float8`. * Weight-only PTQ via **GPTQ** (2/3/4/8-bit) to maximize compression with minimal accuracy impact, especially for large language models (LLMs). -> **Terminology** -> -> * *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale. -> * *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor. -> * *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value). +### Terminology ---- +* *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale. +* *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor. +* *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value). +--- ## Quantization Modes Keras currently focuses on the following numeric formats. Each mode can be applied selectively to layers or to the whole model via the same API. @@ -62,12 +63,12 @@ Keras currently focuses on the following numeric formats. Each mode can be appli ### Implementation notes -* For `int4`, Keras packs signed 4-bit values (range = [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. -* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases. +* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges. +* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. +* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases. * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead. --- - ## Quantizing Keras Models Quantization is applied explicitly after layers or models are built. The API is designed to be predictable: you call quantize, the graph is rewritten, the weights are replaced, and you can immediately run inference or save the model. @@ -81,6 +82,7 @@ Typical workflow: ### Model Quantization + ```python import keras import numpy as np @@ -90,10 +92,13 @@ x_train = keras.ops.array(np.random.rand(100, 10)) y_train = keras.ops.array(np.random.rand(100, 1)) # Build the model. -model = keras.Sequential([ - keras.layers.Dense(32, activation="relu", input_shape=(10,)), - keras.layers.Dense(1) -]) +model = keras.Sequential( + [ + keras.Input(shape=(10,)), + keras.layers.Dense(32, activation="relu"), + keras.layers.Dense(1), + ] +) # Compile and fit the model. model.compile(optimizer="adam", loss="mean_squared_error") @@ -103,6 +108,13 @@ model.fit(x_train, y_train, epochs=1, verbose=0) model.quantize("int8") ``` +
+``` +/Users/jyotindersingh/miniconda3/envs/keras-io-env-3.10/lib/python3.10/site-packages/keras/src/models/model.py:455: UserWarning: Layer InputLayer does not have a `quantize` method implemented. + warnings.warn(str(e)) +``` +
+ **What this does:** Quantizes the weights of the supported layers, and re-wires their forward paths to be compatible with the quantized kernels and quantization scales. **Note**: Throughput gains depend on backend/hardware kernels; in cases where kernels fall back to dequantized matmul, you still get memory savings but smaller speedups. @@ -111,11 +123,12 @@ model.quantize("int8") The Keras quantization framework allows you to quantize each layer separately, without having to quantize the entire model using the same unified API. + ```python from keras import layers input_shape = (10,) -layer = layers.Dense(32, activation="relu", input_shape=input_shape) +layer = layers.Dense(32, activation="relu") layer.build(input_shape) layer.quantize("int4") # Or "int8", "float8", etc. @@ -128,7 +141,6 @@ layer.quantize("int4") # Or "int8", "float8", etc. * Always validate on a small eval set after each step; mixing precisions across residual connections can shift distributions. --- - ## Layer & model coverage Keras supports the following core layers in its quantization framework: @@ -143,7 +155,6 @@ Any composite layers that are built from the above (for example, `MultiHeadAtten Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code. --- - ## Practical guidance * For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention). diff --git a/guides/quantization_overview.py b/guides/quantization_overview.py index 8fb09cad14..2e95dff7f9 100644 --- a/guides/quantization_overview.py +++ b/guides/quantization_overview.py @@ -1,11 +1,13 @@ """ Title: Quantization in Keras -Author: [Jyotinder Singh](https://x.com/Jyotinder_Singh)
-Date created: 2025/10/09
-Last modified: 2025/10/09
+Author: [Jyotinder Singh](https://x.com/Jyotinder_Singh) +Date created: 2025/10/09 +Last modified: 2025/10/09 Description: Overview of quantization in Keras (int8, float8, int4, GPTQ). Accelerator: GPU +""" +""" ## Introduction Modern large models are often **memory- and bandwidth-bound**: most inference time is spent moving tensors between memory and compute units rather than doing math. Quantization reduces the number of bits used to represent the model's weights and (optionally) activations, which: @@ -26,8 +28,9 @@ * *Scale / zero-point:* Quantization maps real values `x` to integers `q` using a scale (and optionally a zero-point). Symmetric schemes use only a scale. * *Per-channel vs per-tensor:* A separate scale per output channel (e.g., per hidden unit) usually preserves accuracy better than a single scale for the whole tensor. * *Calibration:* A short pass over sample data to estimate activation ranges (e.g., max absolute value). +""" - +""" ## Quantization Modes Keras currently focuses on the following numeric formats. Each mode can be applied selectively to layers or to the whole model via the same API. @@ -58,10 +61,13 @@ ### Implementation notes -* For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. -* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases. +* **Dynamic activation quantization**: In the `int4`, `int8` PTQ path, activation scales are computed on-the-fly at runtime (per tensor and per batch) using an AbsMax estimator. This avoids maintaining a separate, fixed set of activation scales from a calibration pass and adapts to varying input ranges. +* **4-bit packing**: For `int4`, Keras packs signed 4-bit values (range = [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels. +* **Calibration Strategy**: Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases. * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead. +""" +""" ## Quantizing Keras Models Quantization is applied explicitly after layers or models are built. The API is designed to be predictable: you call quantize, the graph is rewritten, the weights are replaced, and you can immediately run inference or save the model. @@ -84,10 +90,13 @@ y_train = keras.ops.array(np.random.rand(100, 1)) # Build the model. -model = keras.Sequential([ - keras.layers.Dense(32, activation="relu", input_shape=(10,)), - keras.layers.Dense(1) -]) +model = keras.Sequential( + [ + keras.Input(shape=(10,)), + keras.layers.Dense(32, activation="relu"), + keras.layers.Dense(1), + ] +) # Compile and fit the model. model.compile(optimizer="adam", loss="mean_squared_error") @@ -109,7 +118,7 @@ from keras import layers input_shape = (10,) -layer = layers.Dense(32, activation="relu", input_shape=input_shape) +layer = layers.Dense(32, activation="relu") layer.build(input_shape) layer.quantize("int4") # Or "int8", "float8", etc. @@ -120,7 +129,9 @@ * To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers. * To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally. * Always validate on a small eval set after each step; mixing precisions across residual connections can shift distributions. +""" +""" ## Layer & model coverage Keras supports the following core layers in its quantization framework: @@ -138,4 +149,4 @@ * For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention). * Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device. -""" \ No newline at end of file +""" From 42ae2a47205840bbfc6aa1d58c120e94c91cf090 Mon Sep 17 00:00:00 2001 From: Jyotinder Singh <33001894+JyotinderSingh@users.noreply.github.com> Date: Fri, 10 Oct 2025 14:48:41 +0530 Subject: [PATCH 8/8] remove warning block --- guides/md/quantization_overview.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/guides/md/quantization_overview.md b/guides/md/quantization_overview.md index c08aa5518a..a210273ea5 100644 --- a/guides/md/quantization_overview.md +++ b/guides/md/quantization_overview.md @@ -108,13 +108,6 @@ model.fit(x_train, y_train, epochs=1, verbose=0) model.quantize("int8") ``` -
-``` -/Users/jyotindersingh/miniconda3/envs/keras-io-env-3.10/lib/python3.10/site-packages/keras/src/models/model.py:455: UserWarning: Layer InputLayer does not have a `quantize` method implemented. - warnings.warn(str(e)) -``` -
- **What this does:** Quantizes the weights of the supported layers, and re-wires their forward paths to be compatible with the quantized kernels and quantization scales. **Note**: Throughput gains depend on backend/hardware kernels; in cases where kernels fall back to dequantized matmul, you still get memory savings but smaller speedups.