style fixes

JyotinderSingh · JyotinderSingh · commit 14df39ff0cee · 2025-10-09T17:00:51.000+05:30
diff --git a/guides/ipynb/quantization/overview.ipynb b/guides/ipynb/quantization/overview.ipynb
@@ -65,7 +65,7 @@
     "  * **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.\n",
     "  * **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.\n",
     "\n",
-    "**Implementation notes**\n",
+    "### Implementation notes\n",
     "\n",
     "* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.\n",
     "* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.\n",
@@ -95,21 +95,21 @@
     "import keras\n",
     "import numpy as np\n",
     "\n",
-    "# Sample training data\n",
+    "# Sample training data.\n",
     "x_train = keras.ops.array(np.random.rand(100, 10))\n",
     "y_train = keras.ops.array(np.random.rand(100, 1))\n",
     "\n",
-    "# Build the model\n",
+    "# Build the model.\n",
     "model = keras.Sequential([\n",
     "    keras.layers.Dense(32, activation=\"relu\", input_shape=(10,)),\n",
     "    keras.layers.Dense(1)\n",
     "])\n",
     "\n",
-    "# Compile and fit the model\n",
+    "# Compile and fit the model.\n",
     "model.compile(optimizer=\"adam\", loss=\"mean_squared_error\")\n",
     "model.fit(x_train, y_train, epochs=1, verbose=0)\n",
     "\n",
-    "# Quantize the model\n",
+    "# Quantize the model.\n",
     "model.quantize(\"int8\")"
    ]
   },
@@ -140,7 +140,7 @@
     "layer = layers.Dense(32, activation=\"relu\", input_shape=input_shape)\n",
     "layer.build(input_shape)\n",
     "\n",
-    "layer.quantize(\"int4\")  # or \"int8\", \"float8\", etc."
+    "layer.quantize(\"int4\")  # Or \"int8\", \"float8\", etc."
    ]
   },
   {
@@ -167,7 +167,7 @@
     "\n",
     "Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.\n",
     "\n",
-    "**Practical guidance**\n",
+    "## Practical guidance\n",
     "\n",
     "* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).\n",
     "* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device."
diff --git a/guides/md/quantization/overview.md b/guides/md/quantization/overview.md
@@ -5,7 +5,6 @@
 **Last modified:** 2025/10/09<br>
 **Description:** Overview of quantization in Keras (int8, float8, int4, GPTQ).
 
-
 <img class="k-inline-icon" src="https://colab.research.google.com/img/colab_favicon.ico"/> [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/quantization/overview.ipynb)  <span class="k-dot">•</span><img class="k-inline-icon" src="https://github.com/favicon.ico"/> [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/guides/quantization/overview.py)
 
 ---
@@ -61,11 +60,11 @@ Keras currently focuses on the following numeric formats. Each mode can be appli
   * **Why use it:** Strong accuracy retention at very low bit-widths without retraining; ideal for rapid LLM compression.
   * **What to expect:** Large storage/VRAM savings with small perplexity/accuracy deltas on many decoder-only models when calibrated on task-relevant samples.
 
-> **Implementation notes**
->
-> * For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
-> * Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
-> * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
+### Implementation notes
+
+* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
+* Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
+* Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
 
 ---
 
@@ -122,7 +121,7 @@ layer.build(input_shape)
 layer.quantize("int4")  # or "int8", "float8", etc.
 ```
 
-**When to use layer-wise quantization**
+### When to use layer-wise quantization
 
 * To keep numerically sensitive blocks (e.g., small residual paths, logits) at higher precision while quantizing large projection layers.
 * To mix modes (e.g., attention projections in int4, feed-forward in int8) and measure trade-offs incrementally.
@@ -143,7 +142,7 @@ Any composite layers that are built from the above (for example, `MultiHeadAtten
 
 Since all KerasHub models subclass `keras.Model`, they automatically support the `model.quantize(...)` API. In practice, this means you can take a popular LLM preset, call a single function to obtain an int8/int4/GPTQ-quantized variant, and then save or serve it—without changing your training code.
 
-> **Practical guidance**
->
-> * For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
-> * Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.
+## Practical guidance
+
+* For GPTQ, use a calibration set that matches your inference domain (a few hundred to a few thousand tokens is often enough to see strong retention).
+* Measure both **VRAM** and **throughput/latency**: memory savings are immediate; speedups depend on the availability of fused low-precision kernels on your device.
diff --git a/guides/quantization/overview.py b/guides/quantization/overview.py
@@ -47,7 +47,7 @@
 
   * **How it works:** Two signed 4-bit "nibbles" are packed per int8 byte. Keras uses symmetric per-output-channel scales to dequantize efficiently inside matmul.
   * **Why use it:** Significant VRAM/storage savings for LLMs with acceptable accuracy when combined with robust per-channel scaling.
-  * **What to expect:** ~8× smaller than FP32 (~4× vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.
+  * **What to expect:** ~8x smaller than FP32 (~4x vs FP16) for weights; throughput gains depend on kernel availability and memory bandwidth. Competitive accuracy deltas for encoder-only architectures, may show larger regressions on decoder-only models.
 
 * **`GPTQ` (weight-only 2/3/4/8 bits)**: *Second-order, post-training* method minimizing layer output error.
 
@@ -57,7 +57,7 @@
 
 **Implementation notes**
 
-* For `int4`, Keras packs signed 4-bit values (range ≈ [−8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
+* For `int4`, Keras packs signed 4-bit values (range ≈ [-8, 7]) and stores per-channel scales such as `kernel_scale`. Dequantization happens on the fly, and matmuls use 8-bit (unpacked) kernels.
 * Activation scaling for `int4` / `int8` / `float8` uses **AbsMax calibration** by default (range set by the maximum absolute value observed). Alternative calibration methods (e.g., percentile) may be added in future releases.
 * Per-channel scaling is the default for weights where supported, because it materially improves accuracy at negligible overhead.
 
@@ -78,21 +78,21 @@
 import keras
 import numpy as np
 
-# Sample training data
+# Sample training data.
 x_train = keras.ops.array(np.random.rand(100, 10))
 y_train = keras.ops.array(np.random.rand(100, 1))
 
-# Build the model
+# Build the model.
 model = keras.Sequential([
     keras.layers.Dense(32, activation="relu", input_shape=(10,)),
     keras.layers.Dense(1)
 ])
 
-# Compile and fit the model
+# Compile and fit the model.
 model.compile(optimizer="adam", loss="mean_squared_error")
 model.fit(x_train, y_train, epochs=1, verbose=0)
 
-# Quantize the model
+# Quantize the model.
 model.quantize("int8")
 
 """
@@ -111,7 +111,7 @@
 layer = layers.Dense(32, activation="relu", input_shape=input_shape)
 layer.build(input_shape)
 
-layer.quantize("int4")  # or "int8", "float8", etc.
+layer.quantize("int4")  # Or "int8", "float8", etc.
 
 """
 **When to use layer-wise quantization**