Improve VLM quantization notebook structure with clear step-by-step organization (#1385)

ezelanza · mvafin · commit cbc536d3d27c · 2025-08-12T16:54:18.000+02:00
- Add Step 1: Installation and Setup
- Add Step 2: Data Preparation with sample image display
- Add Step 3: Load Original Model and Test
- Add Step 4: Configure and Apply Quantization (with substeps 4a and 4b)
- Add Step 5: Compare Results (with substeps 5a and 5b)
- Add Conclusion section summarizing benefits
- Improve readability and educational flow of the notebook
diff --git a/notebooks/openvino/visual_language_quantization.ipynb b/notebooks/openvino/visual_language_quantization.ipynb
@@ -12,6 +12,20 @@
     "Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b70eeef0",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## Step 1: Installation and Setup\n",
+    "\n",
+    "First, let's install the required dependencies."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "e8ebc847-8181-4c8a-9236-12cb23904773",
@@ -33,6 +47,28 @@
     "#! pip install \"optimum-intel[openvino]\" datasets num2words"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "7a179812",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## Step 2: Preparation\n",
+    "\n",
+    "Now let's load the processor and prepare our input data. We'll use a sample image of a bee on a flower and ask the model what's on the flower.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "860ff939",
+   "metadata": {},
+   "source": [
+    "![image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "f253327b-af28-41de-b010-8edbec3c2c4a",
@@ -82,6 +118,20 @@
     "print(img_url)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "0c9c5734",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## Step 3: Load Original Model and Test\n",
+    "\n",
+    "Let's load the original FP32 model and test it with our prepared inputs to establish a baseline.\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 3,
@@ -115,6 +165,32 @@
     "print(generated_texts[0])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "1075a71e",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## Step 4: Configure and Apply Quantization\n",
+    "\n",
+    "Now we'll configure the quantization settings and apply them to create an INT8 version of our model. We'll use weight-only quantization for size reduction with minimal accuracy loss. You can explore other quantization options [here](https://huggingface.co/docs/optimum/en/intel/openvino/optimization).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bfd08433",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "### Step 4a: Configure Quantization Settings\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
@@ -149,6 +225,18 @@
     ")\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "e159efa8",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "### Step 4b: Apply Quantization\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
@@ -317,6 +405,32 @@
     "q_model.save_pretrained(int8_model_path)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "0558b3b8",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## Step 5: Compare Results\n",
+    "\n",
+    "Let's test the quantized model and compare it with the original model in terms of both output quality and model size.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a52faa10",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "### Step 5a: Test Quantized Model Output\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -343,6 +457,20 @@
     "print(generated_texts[0])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "5d7778bf",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "### Step 5b: Compare Model Sizes\n",
+    "\n",
+    "Now let's compare the file sizes of the original FP32 model and the quantized INT8 model:\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
@@ -365,32 +493,42 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
-   "id": "8fd53000-1bad-4058-83c7-252f92e6d966",
+   "execution_count": null,
+   "id": "3c862277",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "FP32 model size: 1028.25 MB\n",
-      "INT8 model size: 260.94 MB\n",
-      "INT8 size decrease: 3.94x\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "fp32_model_size = get_model_size(fp32_model_path)\n",
     "int8_model_size = get_model_size(int8_model_path)\n",
     "print(f\"FP32 model size: {fp32_model_size:.2f} MB\")\n",
     "print(f\"INT8 model size: {int8_model_size:.2f} MB\")\n",
     "print(f\"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x\")"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43531db0",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "Great! We've successfully quantized our VLM model using Optimum Intel. The results show:\n",
+    "\n",
+    "1. **Quality**: The quantized model produces the same output as the original model\n",
+    "2. **Size**: We achieved approximately 4x reduction in model size (from ~1GB to ~260MB)\n",
+    "3. **Performance**: The INT8 model has been reduced on size maintaining the accuracy\n",
+    "\n",
+    "This demonstrates how quantization can significantly reduce model size preserving the model's accuracy for visual language tasks.\n"
+   ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "openvino_env",
    "language": "python",
    "name": "python3"
   },
@@ -404,7 +542,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.18"
+   "version": "3.12.7"
   }
  },
  "nbformat": 4,