|
12 | 12 | "Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.\n" |
13 | 13 | ] |
14 | 14 | }, |
| 15 | + { |
| 16 | + "cell_type": "markdown", |
| 17 | + "id": "b70eeef0", |
| 18 | + "metadata": { |
| 19 | + "vscode": { |
| 20 | + "languageId": "raw" |
| 21 | + } |
| 22 | + }, |
| 23 | + "source": [ |
| 24 | + "## Step 1: Installation and Setup\n", |
| 25 | + "\n", |
| 26 | + "First, let's install the required dependencies." |
| 27 | + ] |
| 28 | + }, |
15 | 29 | { |
16 | 30 | "cell_type": "markdown", |
17 | 31 | "id": "e8ebc847-8181-4c8a-9236-12cb23904773", |
|
33 | 47 | "#! pip install \"optimum-intel[openvino]\" datasets num2words" |
34 | 48 | ] |
35 | 49 | }, |
| 50 | + { |
| 51 | + "cell_type": "markdown", |
| 52 | + "id": "7a179812", |
| 53 | + "metadata": { |
| 54 | + "vscode": { |
| 55 | + "languageId": "raw" |
| 56 | + } |
| 57 | + }, |
| 58 | + "source": [ |
| 59 | + "## Step 2: Preparation\n", |
| 60 | + "\n", |
| 61 | + "Now let's load the processor and prepare our input data. We'll use a sample image of a bee on a flower and ask the model what's on the flower.\n" |
| 62 | + ] |
| 63 | + }, |
| 64 | + { |
| 65 | + "cell_type": "markdown", |
| 66 | + "id": "860ff939", |
| 67 | + "metadata": {}, |
| 68 | + "source": [ |
| 69 | + "" |
| 70 | + ] |
| 71 | + }, |
36 | 72 | { |
37 | 73 | "cell_type": "markdown", |
38 | 74 | "id": "f253327b-af28-41de-b010-8edbec3c2c4a", |
|
82 | 118 | "print(img_url)" |
83 | 119 | ] |
84 | 120 | }, |
| 121 | + { |
| 122 | + "cell_type": "markdown", |
| 123 | + "id": "0c9c5734", |
| 124 | + "metadata": { |
| 125 | + "vscode": { |
| 126 | + "languageId": "raw" |
| 127 | + } |
| 128 | + }, |
| 129 | + "source": [ |
| 130 | + "## Step 3: Load Original Model and Test\n", |
| 131 | + "\n", |
| 132 | + "Let's load the original FP32 model and test it with our prepared inputs to establish a baseline.\n" |
| 133 | + ] |
| 134 | + }, |
85 | 135 | { |
86 | 136 | "cell_type": "code", |
87 | 137 | "execution_count": 3, |
|
115 | 165 | "print(generated_texts[0])" |
116 | 166 | ] |
117 | 167 | }, |
| 168 | + { |
| 169 | + "cell_type": "markdown", |
| 170 | + "id": "1075a71e", |
| 171 | + "metadata": { |
| 172 | + "vscode": { |
| 173 | + "languageId": "raw" |
| 174 | + } |
| 175 | + }, |
| 176 | + "source": [ |
| 177 | + "## Step 4: Configure and Apply Quantization\n", |
| 178 | + "\n", |
| 179 | + "Now we'll configure the quantization settings and apply them to create an INT8 version of our model. We'll use weight-only quantization for size reduction with minimal accuracy loss. You can explore other quantization options [here](https://huggingface.co/docs/optimum/en/intel/openvino/optimization).\n" |
| 180 | + ] |
| 181 | + }, |
| 182 | + { |
| 183 | + "cell_type": "markdown", |
| 184 | + "id": "bfd08433", |
| 185 | + "metadata": { |
| 186 | + "vscode": { |
| 187 | + "languageId": "raw" |
| 188 | + } |
| 189 | + }, |
| 190 | + "source": [ |
| 191 | + "### Step 4a: Configure Quantization Settings\n" |
| 192 | + ] |
| 193 | + }, |
118 | 194 | { |
119 | 195 | "cell_type": "code", |
120 | 196 | "execution_count": 4, |
|
149 | 225 | ")\n" |
150 | 226 | ] |
151 | 227 | }, |
| 228 | + { |
| 229 | + "cell_type": "markdown", |
| 230 | + "id": "e159efa8", |
| 231 | + "metadata": { |
| 232 | + "vscode": { |
| 233 | + "languageId": "raw" |
| 234 | + } |
| 235 | + }, |
| 236 | + "source": [ |
| 237 | + "### Step 4b: Apply Quantization\n" |
| 238 | + ] |
| 239 | + }, |
152 | 240 | { |
153 | 241 | "cell_type": "code", |
154 | 242 | "execution_count": 5, |
|
317 | 405 | "q_model.save_pretrained(int8_model_path)" |
318 | 406 | ] |
319 | 407 | }, |
| 408 | + { |
| 409 | + "cell_type": "markdown", |
| 410 | + "id": "0558b3b8", |
| 411 | + "metadata": { |
| 412 | + "vscode": { |
| 413 | + "languageId": "raw" |
| 414 | + } |
| 415 | + }, |
| 416 | + "source": [ |
| 417 | + "## Step 5: Compare Results\n", |
| 418 | + "\n", |
| 419 | + "Let's test the quantized model and compare it with the original model in terms of both output quality and model size.\n" |
| 420 | + ] |
| 421 | + }, |
| 422 | + { |
| 423 | + "cell_type": "markdown", |
| 424 | + "id": "a52faa10", |
| 425 | + "metadata": { |
| 426 | + "vscode": { |
| 427 | + "languageId": "raw" |
| 428 | + } |
| 429 | + }, |
| 430 | + "source": [ |
| 431 | + "### Step 5a: Test Quantized Model Output\n" |
| 432 | + ] |
| 433 | + }, |
320 | 434 | { |
321 | 435 | "cell_type": "code", |
322 | 436 | "execution_count": 6, |
|
343 | 457 | "print(generated_texts[0])" |
344 | 458 | ] |
345 | 459 | }, |
| 460 | + { |
| 461 | + "cell_type": "markdown", |
| 462 | + "id": "5d7778bf", |
| 463 | + "metadata": { |
| 464 | + "vscode": { |
| 465 | + "languageId": "raw" |
| 466 | + } |
| 467 | + }, |
| 468 | + "source": [ |
| 469 | + "### Step 5b: Compare Model Sizes\n", |
| 470 | + "\n", |
| 471 | + "Now let's compare the file sizes of the original FP32 model and the quantized INT8 model:\n" |
| 472 | + ] |
| 473 | + }, |
346 | 474 | { |
347 | 475 | "cell_type": "code", |
348 | 476 | "execution_count": 7, |
|
365 | 493 | }, |
366 | 494 | { |
367 | 495 | "cell_type": "code", |
368 | | - "execution_count": 8, |
369 | | - "id": "8fd53000-1bad-4058-83c7-252f92e6d966", |
| 496 | + "execution_count": null, |
| 497 | + "id": "3c862277", |
370 | 498 | "metadata": {}, |
371 | | - "outputs": [ |
372 | | - { |
373 | | - "name": "stdout", |
374 | | - "output_type": "stream", |
375 | | - "text": [ |
376 | | - "FP32 model size: 1028.25 MB\n", |
377 | | - "INT8 model size: 260.94 MB\n", |
378 | | - "INT8 size decrease: 3.94x\n" |
379 | | - ] |
380 | | - } |
381 | | - ], |
| 499 | + "outputs": [], |
382 | 500 | "source": [ |
383 | 501 | "fp32_model_size = get_model_size(fp32_model_path)\n", |
384 | 502 | "int8_model_size = get_model_size(int8_model_path)\n", |
385 | 503 | "print(f\"FP32 model size: {fp32_model_size:.2f} MB\")\n", |
386 | 504 | "print(f\"INT8 model size: {int8_model_size:.2f} MB\")\n", |
387 | 505 | "print(f\"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x\")" |
388 | 506 | ] |
| 507 | + }, |
| 508 | + { |
| 509 | + "cell_type": "markdown", |
| 510 | + "id": "43531db0", |
| 511 | + "metadata": { |
| 512 | + "vscode": { |
| 513 | + "languageId": "raw" |
| 514 | + } |
| 515 | + }, |
| 516 | + "source": [ |
| 517 | + "## Conclusion\n", |
| 518 | + "\n", |
| 519 | + "Great! We've successfully quantized our VLM model using Optimum Intel. The results show:\n", |
| 520 | + "\n", |
| 521 | + "1. **Quality**: The quantized model produces the same output as the original model\n", |
| 522 | + "2. **Size**: We achieved approximately 4x reduction in model size (from ~1GB to ~260MB)\n", |
| 523 | + "3. **Performance**: The INT8 model has been reduced on size maintaining the accuracy\n", |
| 524 | + "\n", |
| 525 | + "This demonstrates how quantization can significantly reduce model size preserving the model's accuracy for visual language tasks.\n" |
| 526 | + ] |
389 | 527 | } |
390 | 528 | ], |
391 | 529 | "metadata": { |
392 | 530 | "kernelspec": { |
393 | | - "display_name": "Python 3 (ipykernel)", |
| 531 | + "display_name": "openvino_env", |
394 | 532 | "language": "python", |
395 | 533 | "name": "python3" |
396 | 534 | }, |
|
404 | 542 | "name": "python", |
405 | 543 | "nbconvert_exporter": "python", |
406 | 544 | "pygments_lexer": "ipython3", |
407 | | - "version": "3.9.18" |
| 545 | + "version": "3.12.7" |
408 | 546 | } |
409 | 547 | }, |
410 | 548 | "nbformat": 4, |
|
0 commit comments