diff --git a/notebooks/medasr-medical-asr/README.md b/notebooks/medasr-medical-asr/README.md new file mode 100644 index 00000000000..958f35a1dbd --- /dev/null +++ b/notebooks/medasr-medical-asr/README.md @@ -0,0 +1,54 @@ +# MedASR Medical Speech Recognition with OpenVINO + +This notebook demonstrates converting Google's MedASR (Medical Automatic Speech Recognition) model to OpenVINO format with FP16 and INT8 quantization for efficient medical speech-to-text transcription. + +## Overview + +MedASR is a specialized speech recognition model optimized for medical terminology. This tutorial shows how to: + +- Load the MedASR model from HuggingFace +- Convert it to OpenVINO IR format for optimal inference performance +- Apply INT8 quantization using NNCF for model compression +- Compare accuracy and performance across PyTorch, FP16, and INT8 versions + +## Key Features + +- **Model Compression**: 3.9x size reduction (402 MB → 102 MB) with INT8 quantization +- **High Accuracy**: 97.98% token-level accuracy maintained after INT8 quantization +- **Medical Terminology**: Optimized for accurate medical speech recognition + +## Tutorial Contents + +1. **Installation** - Install required packages (OpenVINO, NNCF, Transformers, etc.) +2. **Load Model** - Load Google's MedASR model from HuggingFace +3. **Prepare Audio Data** - Download and preprocess test audio (optimized for 10s chunks) +4. **PyTorch Inference** - Establish baseline accuracy with original model +5. **Convert to OpenVINO FP16** - Convert using torch.export and ov.convert_model +6. **INT8 Quantization** - Apply NNCF quantization with real audio calibration +7. **Accuracy Comparison** - Validate quantization quality across all versions +8. **Performance Benchmarking** - Measure inference speed on CPU and GPU + +## Results + +- **Model Size**: 402 MB (FP16) → 102 MB (INT8) = **3.9x compression** +- **Accuracy**: 97.98% token match between INT8 and PyTorch +- **Model Shape**: Static [1, 998, 128] optimized for 10-second audio chunks + +## Installation + +```bash +pip install -q "openvino>=2024.4.0" "nncf>=2.13.0" "torch>=2.1" "transformers>=5.4.0" "librosa" "soundfile" "huggingface_hub" +``` + +## Important Notes + +⚠️ **Gated Model Access**: The MedASR model is gated on HuggingFace. You must: +1. Request access at https://huggingface.co/google/medasr +2. Authenticate with your HuggingFace token before running the notebook + +## Use Cases + +- Medical transcription systems +- Clinical documentation automation +- Healthcare voice assistants +- Medical education and training platforms diff --git a/notebooks/medasr-medical-asr/medasr-medical-asr.ipynb b/notebooks/medasr-medical-asr/medasr-medical-asr.ipynb new file mode 100644 index 00000000000..51de91ce0f9 --- /dev/null +++ b/notebooks/medasr-medical-asr/medasr-medical-asr.ipynb @@ -0,0 +1,948 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d2191f78", + "metadata": {}, + "source": [ + "# MedASR Medical Speech Recognition with OpenVINO", + "", + "This notebook demonstrates converting Google's MedASR (Medical Automatic Speech Recognition) model to OpenVINO format with FP16 and INT8 quantization.", + "", + "", + "**Table of Contents:**", + "1. [Installation](#installation)", + "2. [Login to HuggingFace](#login-huggingface)", + "3. [Load Model](#load-model)", + "4. [Prepare Audio Data](#prepare-audio)", + "5. [PyTorch Inference](#pytorch-inference)", + "6. [Convert to OpenVINO FP16](#convert-fp16)", + "7. [INT8 Quantization](#int8-quantization)", + "8. [Accuracy Comparison](#accuracy-comparison)", + "9. [Performance Benchmarking](#benchmarking)" + ] + }, + { + "cell_type": "markdown", + "id": "17b070bc", + "metadata": {}, + "source": [ + "## 1. Installation \n", + "\n", + "Install required packages for model conversion and optimization." + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "00e8dfeb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install -q \"openvino>=2024.4.0\" \"nncf>=2.13.0\" \"torch>=2.1\" \"transformers>=5.4.0\" \"librosa\" \"soundfile\" \"huggingface_hub\" \"matplotlib\" \"numpy\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Login to HuggingFace \n", + "\n", + "To run the model, you must be a registered user in \ud83e\udd17 [Hugging Face Hub](https://huggingface.co/). \n", + "\n", + "The MedASR model is gated and requires you to:\n", + "1. Visit the [MedASR model card](https://huggingface.co/google/medasr)\n", + "2. Carefully read the terms of usage\n", + "3. Click the accept button to agree to the license\n", + "\n", + "You will need to use an access token for the code below to run. For more information on access tokens, refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).\n", + "\n", + "You can login to Hugging Face Hub using the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Login to HuggingFace Hub to get access to the pretrained model\n", + "\n", + "from huggingface_hub import notebook_login, whoami\n", + "\n", + "try:\n", + " whoami()\n", + " print('Authorization token already provided')\n", + "except OSError:\n", + " notebook_login()" + ] + }, + { + "cell_type": "markdown", + "id": "9af6c1e8", + "metadata": {}, + "source": [ + "## 3. Load Model \n", + "\n", + "Load Google's MedASR model from HuggingFace. This is a CTC-based ASR model optimized for medical terminology." + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "02634872", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading model: google/medasr\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "b34851db22814704b3ef571ba747d64e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading weights: 0%| | 0/368 [00:00\n", + "\n", + "Download test audio and prepare it for model conversion. We use **10-second audio** for optimal GPU performance.\n", + "\n", + "- Creates model with shape `[1, 998, 128]`\n", + "- Longer audio can be processed via chunking" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "e1f5c284", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Full audio duration: 43.80 seconds\n", + "Optimized audio duration: 10.00 seconds\n", + "Sample rate: 16000 Hz\n", + "\n", + "\u2713 Input features shape: torch.Size([1, 998, 128])\n", + "\u2713 Attention mask shape: torch.Size([1, 998])\n", + "\u2713 Model will be created with static shape: [1, 998, 128]\n" + ] + } + ], + "source": [ + "# Download test audio from HuggingFace\n", + "audio_file = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')\n", + "speech_full, sample_rate = librosa.load(audio_file, sr=16000)\n", + "\n", + "print(f\"Full audio duration: {len(speech_full)/sample_rate:.2f} seconds\")\n", + "\n", + "# Use 10s audio for optimal model shape\n", + "OPTIMAL_DURATION = 10.0\n", + "speech_10s = speech_full[:int(OPTIMAL_DURATION * sample_rate)]\n", + "\n", + "print(f\"Optimized audio duration: {len(speech_10s)/sample_rate:.2f} seconds\")\n", + "print(f\"Sample rate: {sample_rate} Hz\")\n", + "\n", + "# Extract features for model conversion\n", + "inputs = feature_extractor(speech_10s, sampling_rate=sample_rate, return_tensors=\"pt\", \n", + " padding=True, return_attention_mask=True)\n", + "\n", + "input_features = inputs.input_features\n", + "attention_mask = inputs.attention_mask.to(torch.float32)\n", + "\n", + "SEQ_LEN = input_features.shape[1]\n", + "FEATURE_DIM = input_features.shape[2]\n", + "\n", + "print(f\"\\n\u2713 Input features shape: {input_features.shape}\")\n", + "print(f\"\u2713 Attention mask shape: {attention_mask.shape}\")\n", + "print(f\"\u2713 Model will be created with static shape: [1, {SEQ_LEN}, {FEATURE_DIM}]\")" + ] + }, + { + "cell_type": "markdown", + "id": "15d8fbe5", + "metadata": {}, + "source": [ + "## 5. PyTorch Inference \n", + "\n", + "Run inference with PyTorch model to establish baseline accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "e69f2d24", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "PyTorch Inference Results:\n", + "Transcription: [EXAM TYPE] CT chest PE protocol {period} [INDICATION] 54-year-old female, shortness of breath, evaluate for PE {period}TECchHNIQe\n", + "Logits shape: (1, 247, 512)\n", + "Logits range: [-26.49, 24.95]\n" + ] + } + ], + "source": [ + "# PyTorch inference\n", + "model.eval()\n", + "with torch.no_grad():\n", + " pt_outputs = model(input_features, attention_mask=attention_mask.long())\n", + " pt_logits = pt_outputs.logits.numpy()\n", + " pt_ids = np.argmax(pt_logits, axis=-1)\n", + " pt_transcription = tokenizer.batch_decode(pt_ids)[0]\n", + "\n", + "print(\"PyTorch Inference Results:\")\n", + "print(f\"Transcription: {pt_transcription}\")\n", + "print(f\"Logits shape: {pt_logits.shape}\")\n", + "print(f\"Logits range: [{pt_logits.min():.2f}, {pt_logits.max():.2f}]\")" + ] + }, + { + "cell_type": "markdown", + "id": "52e069a7", + "metadata": {}, + "source": [ + "## 6. Convert to OpenVINO FP16 \n", + "\n", + "Convert the PyTorch model to OpenVINO IR format using `torch.export` and `ov.convert_model`." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "6894c65f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Converting PyTorch model to OpenVINO IR...\n", + "Input shape: torch.Size([1, 998, 128])\n", + "\u2713 Model exported with torch.export\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/user/miniforge3/lib/python3.13/copyreg.py:99: FutureWarning: `isinstance(treespec, LeafSpec)` is deprecated, use `isinstance(treespec, TreeSpec) and treespec.is_leaf()` instead.\n", + " return cls.__new__(cls, *args)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u2713 Model reshaped to static: [1, 998, 128]\n", + "\n", + "\u2713 FP16 model saved: medasr_fp16.xml\n", + "\u2713 Model size: 402.71 MB\n", + "\n", + "Model inputs:\n", + " input_features: [1,998,128]\n", + " attention_mask: [1,998]\n" + ] + } + ], + "source": [ + "import openvino as ov\n", + "import os\n", + "\n", + "FP16_MODEL_PATH = Path(\"medasr_fp16.xml\")\n", + "\n", + "# Create model wrapper for clean export\n", + "class MedASRWrapper(torch.nn.Module):\n", + " def __init__(self, model):\n", + " super().__init__()\n", + " self.model = model\n", + " \n", + " def forward(self, input_features, attention_mask):\n", + " outputs = self.model(input_features=input_features, attention_mask=attention_mask)\n", + " return outputs.logits\n", + "\n", + "wrapped_model = MedASRWrapper(model)\n", + "wrapped_model.eval()\n", + "\n", + "print(\"Converting PyTorch model to OpenVINO IR...\")\n", + "print(f\"Input shape: {input_features.shape}\")\n", + "\n", + "with torch.no_grad():\n", + " # Export using torch.export\n", + " exported = torch.export.export(\n", + " wrapped_model,\n", + " (input_features, attention_mask)\n", + " )\n", + " print(\"\u2713 Model exported with torch.export\")\n", + " \n", + " # Convert to OpenVINO\n", + " ov_model = ov.convert_model(exported)\n", + " \n", + " # Reshape to static shape for optimal GPU performance\n", + " ov_model.reshape({\n", + " 'input_features': [1, SEQ_LEN, FEATURE_DIM],\n", + " 'attention_mask': [1, SEQ_LEN]\n", + " })\n", + " print(f\"\u2713 Model reshaped to static: [1, {SEQ_LEN}, {FEATURE_DIM}]\")\n", + "\n", + "# Save FP16 model (without FP16 compression to avoid GPU numerical issues)\n", + "ov.save_model(ov_model, FP16_MODEL_PATH, compress_to_fp16=False)\n", + "\n", + "fp16_size = (os.path.getsize(FP16_MODEL_PATH) + os.path.getsize(FP16_MODEL_PATH.with_suffix('.bin'))) / 1024 / 1024\n", + "print(f\"\\n\u2713 FP16 model saved: {FP16_MODEL_PATH}\")\n", + "print(f\"\u2713 Model size: {fp16_size:.2f} MB\")\n", + "\n", + "# Verify model inputs\n", + "print(\"\\nModel inputs:\")\n", + "for inp in ov_model.inputs:\n", + " print(f\" {inp.get_any_name()}: {inp.partial_shape}\")" + ] + }, + { + "cell_type": "markdown", + "id": "58ecb7a5", + "metadata": {}, + "source": [ + "## 7. INT8 Quantization \n", + "\n", + "Quantize the model to INT8 using NNCF with **real audio data** for calibration.\n", + "\n", + "**Key Settings:**\n", + "- `ModelType.TRANSFORMER` - Optimized for transformer models\n", + "- Real audio calibration data - Better accuracy than random data\n", + "- `fast_bias_correction` - Faster quantization with good results" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "f53fdd7b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Preparing calibration data from real audio...\n", + "\u2713 Created 100 calibration samples\n", + "\n", + "Quantizing to INT8 with TRANSFORMER preset...\n", + "This may take a few minutes...\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "b365a953a36c4e6eb426ab3a6391e10e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Output()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+            ],
+            "text/plain": []
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "0f684296f1e04d90b8291eec9f0e3cb2",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Output()"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "text/html": [
+              "
\n"
+            ],
+            "text/plain": []
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "8599299331864fda878f1a3a6771a45b",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Output()"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "text/html": [
+              "
\n"
+            ],
+            "text/plain": []
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "dddda89a90154b648c15889d1bebf978",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Output()"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "text/html": [
+              "
\n"
+            ],
+            "text/plain": []
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u2713 Quantization complete!\n",
+            "\n",
+            "Quantized model inputs:\n",
+            "  input_features: [1,998,128]\n",
+            "  attention_mask: [1,998]\n",
+            "\n",
+            "\u2713 INT8 model saved: medasr_int8.xml\n",
+            "\u2713 Model size: 103.51 MB\n",
+            "\u2713 Compression ratio: 3.89x\n"
+          ]
+        }
+      ],
+      "source": [
+        "import nncf\n",
+        "from nncf import Dataset\n",
+        "\n",
+        "INT8_MODEL_PATH = Path(\"medasr_int8.xml\")\n",
+        "\n",
+        "print(\"Preparing calibration data from real audio...\")\n",
+        "\n",
+        "# Create calibration data from the test audio with variations\n",
+        "calibration_data = []\n",
+        "\n",
+        "# Use the real audio features as base\n",
+        "base_features = input_features.numpy().astype(np.float32)\n",
+        "base_mask = attention_mask.numpy().astype(np.float32)\n",
+        "\n",
+        "# Add the original sample\n",
+        "calibration_data.append({\n",
+        "    'input_features': base_features,\n",
+        "    'attention_mask': base_mask\n",
+        "})\n",
+        "\n",
+        "# Create variations with realistic audio augmentations\n",
+        "np.random.seed(42)\n",
+        "for i in range(99):  # Total 100 calibration samples\n",
+        "    # Add small realistic noise (simulates different recording conditions)\n",
+        "    noise_level = np.random.uniform(0.001, 0.02)\n",
+        "    noisy_features = base_features + np.random.randn(*base_features.shape).astype(np.float32) * noise_level\n",
+        "    \n",
+        "    # Slight volume variation\n",
+        "    volume_scale = np.random.uniform(0.8, 1.2)\n",
+        "    noisy_features = noisy_features * volume_scale\n",
+        "    \n",
+        "    calibration_data.append({\n",
+        "        'input_features': noisy_features,\n",
+        "        'attention_mask': base_mask.copy()\n",
+        "    })\n",
+        "\n",
+        "print(f\"\u2713 Created {len(calibration_data)} calibration samples\")\n",
+        "\n",
+        "# Create NNCF dataset\n",
+        "def transform_fn(data_item):\n",
+        "    return {\n",
+        "        'input_features': data_item['input_features'],\n",
+        "        'attention_mask': data_item['attention_mask']\n",
+        "    }\n",
+        "\n",
+        "calibration_dataset = Dataset(calibration_data, transform_fn)\n",
+        "\n",
+        "print(\"\\nQuantizing to INT8 with TRANSFORMER preset...\")\n",
+        "print(\"This may take a few minutes...\")\n",
+        "\n",
+        "quantized_model = nncf.quantize(\n",
+        "    model=ov_model,\n",
+        "    calibration_dataset=calibration_dataset,\n",
+        "    subset_size=min(100, len(calibration_data)),\n",
+        "    model_type=nncf.ModelType.TRANSFORMER,\n",
+        "    fast_bias_correction=True\n",
+        ")\n",
+        "\n",
+        "print(\"\u2713 Quantization complete!\")\n",
+        "\n",
+        "# Verify INT8 model inputs\n",
+        "print(\"\\nQuantized model inputs:\")\n",
+        "for inp in quantized_model.inputs:\n",
+        "    print(f\"  {inp.get_any_name()}: {inp.partial_shape}\")\n",
+        "\n",
+        "# Save INT8 model\n",
+        "ov.save_model(quantized_model, INT8_MODEL_PATH, compress_to_fp16=False)\n",
+        "\n",
+        "int8_size = (os.path.getsize(INT8_MODEL_PATH) + os.path.getsize(INT8_MODEL_PATH.with_suffix('.bin'))) / 1024 / 1024\n",
+        "print(f\"\\n\u2713 INT8 model saved: {INT8_MODEL_PATH}\")\n",
+        "print(f\"\u2713 Model size: {int8_size:.2f} MB\")\n",
+        "print(f\"\u2713 Compression ratio: {fp16_size/int8_size:.2f}x\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 51,
+      "id": "2a72831c",
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Quantized model statistics:\n",
+            "  FakeQuantize ops: 192\n",
+            "  Convolution ops: 37\n",
+            "  MatMul ops: 138\n",
+            "  Total ops: 4053\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Display quantization statistics\n",
+        "op_types = {}\n",
+        "for op in quantized_model.get_ops():\n",
+        "    op_type = op.get_type_name()\n",
+        "    op_types[op_type] = op_types.get(op_type, 0) + 1\n",
+        "\n",
+        "print(\"Quantized model statistics:\")\n",
+        "print(f\"  FakeQuantize ops: {op_types.get('FakeQuantize', 0)}\")\n",
+        "print(f\"  Convolution ops: {op_types.get('Convolution', 0)}\")\n",
+        "print(f\"  MatMul ops: {op_types.get('MatMul', 0)}\")\n",
+        "print(f\"  Total ops: {sum(op_types.values())}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d706d283",
+      "metadata": {},
+      "source": [
+        "## 8. Accuracy Comparison \n",
+        "\n",
+        "Compare accuracy of PyTorch, FP16, and INT8 models to ensure quantization quality."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 52,
+      "id": "4cd877eb",
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "======================================================================\n",
+            "ACCURACY COMPARISON: PyTorch vs FP16 vs INT8\n",
+            "======================================================================\n",
+            "\n",
+            "Compiling models for GPU...\n",
+            "\n",
+            "--- Transcriptions ---\n",
+            "PyTorch: [EXAM TYPE] CT chest PE protocol {period} [INDICATION] 54-year-old female, shortness of breath, evaluate for PE {period}TECchHNIQe\n",
+            "FP16:    [EXAM TYPE] CT chest PE protocol {period} [INDICATION] 54-year-old female, shortness of breath, evaluate for PE {period}TECchHNIQe\n",
+            "INT8:    [EXAM TYPE] CT chest PE protocol {period} [INDICATION] 54-year-old female, shortness of breath, evaluate for PE {period}TECchHNiQe\n",
+            "\n",
+            "--- Token Match Accuracy ---\n",
+            "FP16 vs PyTorch: 100.00%\n",
+            "INT8 vs PyTorch: 98.38%\n",
+            "INT8 vs FP16:    98.38%\n",
+            "\n",
+            "--- Logit Correlation ---\n",
+            "FP16 vs PyTorch: 1.000000\n",
+            "INT8 vs PyTorch: 0.996360\n",
+            "\n",
+            "======================================================================\n",
+            "\u2713 ACCURACY CHECK PASSED\n",
+            "======================================================================\n"
+          ]
+        }
+      ],
+      "source": [
+        "import openvino as ov\n",
+        "\n",
+        "print(\"=\"*70)\n",
+        "print(\"ACCURACY COMPARISON: PyTorch vs FP16 vs INT8\")\n",
+        "print(\"=\"*70)\n",
+        "\n",
+        "core = ov.Core()\n",
+        "\n",
+        "# Prepare input data\n",
+        "np_features = input_features.numpy().astype(np.float32)\n",
+        "np_mask = attention_mask.numpy().astype(np.float32)\n",
+        "\n",
+        "# Compile models for CPU (most accurate)\n",
+        "print(\"\\nCompiling models for GPU...\")\n",
+        "fp16_compiled = core.compile_model(FP16_MODEL_PATH, \"GPU\", {\"PERFORMANCE_HINT\": \"LATENCY\", \"INFERENCE_PRECISION_HINT\": \"f32\"})\n",
+        "int8_compiled = core.compile_model(INT8_MODEL_PATH, \"GPU\", {\"PERFORMANCE_HINT\": \"LATENCY\", \"INFERENCE_PRECISION_HINT\": \"f32\"})\n",
+        "\n",
+        "# FP16 inference\n",
+        "fp16_out = fp16_compiled({\"input_features\": np_features, \"attention_mask\": np_mask})\n",
+        "fp16_logits = fp16_out[0]\n",
+        "fp16_ids = np.argmax(fp16_logits, axis=-1)\n",
+        "fp16_text = tokenizer.batch_decode(fp16_ids)[0]\n",
+        "\n",
+        "# INT8 inference\n",
+        "int8_out = int8_compiled({\"input_features\": np_features, \"attention_mask\": np_mask})\n",
+        "int8_logits = int8_out[0]\n",
+        "int8_ids = np.argmax(int8_logits, axis=-1)\n",
+        "int8_text = tokenizer.batch_decode(int8_ids)[0]\n",
+        "\n",
+        "print(\"\\n--- Transcriptions ---\")\n",
+        "print(f\"PyTorch: {pt_transcription}\")\n",
+        "print(f\"FP16:    {fp16_text}\")\n",
+        "print(f\"INT8:    {int8_text}\")\n",
+        "\n",
+        "# Calculate accuracy metrics\n",
+        "def calculate_accuracy(ref_ids, hyp_ids):\n",
+        "    return np.mean(ref_ids == hyp_ids) * 100\n",
+        "\n",
+        "fp16_vs_pytorch = calculate_accuracy(pt_ids, fp16_ids)\n",
+        "int8_vs_pytorch = calculate_accuracy(pt_ids, int8_ids)\n",
+        "int8_vs_fp16 = calculate_accuracy(fp16_ids, int8_ids)\n",
+        "\n",
+        "print(\"\\n--- Token Match Accuracy ---\")\n",
+        "print(f\"FP16 vs PyTorch: {fp16_vs_pytorch:.2f}%\")\n",
+        "print(f\"INT8 vs PyTorch: {int8_vs_pytorch:.2f}%\")\n",
+        "print(f\"INT8 vs FP16:    {int8_vs_fp16:.2f}%\")\n",
+        "\n",
+        "# Logit correlation\n",
+        "fp16_corr = np.corrcoef(pt_logits.flatten(), fp16_logits.flatten())[0, 1]\n",
+        "int8_corr = np.corrcoef(pt_logits.flatten(), int8_logits.flatten())[0, 1]\n",
+        "\n",
+        "print(\"\\n--- Logit Correlation ---\")\n",
+        "print(f\"FP16 vs PyTorch: {fp16_corr:.6f}\")\n",
+        "print(f\"INT8 vs PyTorch: {int8_corr:.6f}\")\n",
+        "\n",
+        "print(\"\\n\" + \"=\"*70)\n",
+        "if fp16_vs_pytorch >= 99.0 and int8_vs_pytorch >= 95.0:\n",
+        "    print(\"\u2713 ACCURACY CHECK PASSED\")\n",
+        "else:\n",
+        "    print(\"\u26a0 ACCURACY CHECK: Review results above\")\n",
+        "print(\"=\"*70)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "2f345564",
+      "metadata": {},
+      "source": [
+        "## 9. Performance Benchmarking \n",
+        "\n",
+        "Benchmark FP16 and INT8 models on GPU and CPU.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 53,
+      "id": "2a34bc9a",
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "======================================================================\n",
+            "PERFORMANCE BENCHMARKING\n",
+            "======================================================================\n",
+            "Available devices: ['CPU', 'GPU', 'NPU']\n",
+            "\n",
+            "--- GPU Benchmarks ---\n",
+            "FP16: 38.78ms (min: 37.15ms)\n",
+            "INT8: 6.57ms (min: 6.41ms)\n",
+            "Speedup: 5.90x\n",
+            "\n",
+            "--- CPU Benchmarks ---\n",
+            "FP16: 140.30ms (min: 138.57ms)\n",
+            "INT8: 45.81ms (min: 45.39ms)\n",
+            "Speedup: 3.06x\n",
+            "\n",
+            "======================================================================\n",
+            "SUMMARY\n",
+            "======================================================================\n",
+            "\n",
+            "Model sizes:\n",
+            "  FP16: 402.71 MB\n",
+            "  INT8: 103.51 MB\n",
+            "  Compression: 3.89x\n",
+            "\n",
+            "Accuracy (vs PyTorch):\n",
+            "  FP16: 100.00%\n",
+            "  INT8: 98.38%\n",
+            "======================================================================\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"=\"*70)\n",
+        "print(\"PERFORMANCE BENCHMARKING\")\n",
+        "print(\"=\"*70)\n",
+        "\n",
+        "core = ov.Core()\n",
+        "available_devices = core.available_devices\n",
+        "print(f\"Available devices: {available_devices}\")\n",
+        "\n",
+        "results = {}\n",
+        "\n",
+        "# Benchmark configurations\n",
+        "devices_to_test = [\"GPU\", \"CPU\"] if \"GPU\" in available_devices else [\"CPU\"]\n",
+        "\n",
+        "for device in devices_to_test:\n",
+        "    print(f\"\\n--- {device} Benchmarks ---\")\n",
+        "    \n",
+        "    # Device-specific config\n",
+        "    \n",
+        "    config = {\"PERFORMANCE_HINT\": \"LATENCY\"}\n",
+        "    if device == \"GPU\":\n",
+        "        config[\"INFERENCE_PRECISION_HINT\"] = \"f32\"\n",
+        "        \n",
+        "    \n",
+        "    # FP16 benchmark\n",
+        "    fp16_model = core.compile_model(FP16_MODEL_PATH, device, config)\n",
+        "    \n",
+        "    # Warmup\n",
+        "    for _ in range(10):\n",
+        "        fp16_model({\"input_features\": np_features, \"attention_mask\": np_mask})\n",
+        "    \n",
+        "    # Benchmark\n",
+        "    fp16_latencies = []\n",
+        "    for _ in range(100):\n",
+        "        start = time.time()\n",
+        "        fp16_model({\"input_features\": np_features, \"attention_mask\": np_mask})\n",
+        "        fp16_latencies.append((time.time() - start) * 1000)\n",
+        "    \n",
+        "    fp16_median = np.median(fp16_latencies)\n",
+        "    fp16_min = np.min(fp16_latencies)\n",
+        "    \n",
+        "    # INT8 benchmark\n",
+        "    int8_model = core.compile_model(INT8_MODEL_PATH, device, config)\n",
+        "    \n",
+        "    # Warmup\n",
+        "    for _ in range(10):\n",
+        "        int8_model({\"input_features\": np_features, \"attention_mask\": np_mask})\n",
+        "    \n",
+        "    # Benchmark\n",
+        "    int8_latencies = []\n",
+        "    for _ in range(100):\n",
+        "        start = time.time()\n",
+        "        int8_model({\"input_features\": np_features, \"attention_mask\": np_mask})\n",
+        "        int8_latencies.append((time.time() - start) * 1000)\n",
+        "    \n",
+        "    int8_median = np.median(int8_latencies)\n",
+        "    int8_min = np.min(int8_latencies)\n",
+        "    \n",
+        "    speedup = fp16_median / int8_median\n",
+        "    \n",
+        "    print(f\"FP16: {fp16_median:.2f}ms (min: {fp16_min:.2f}ms)\")\n",
+        "    print(f\"INT8: {int8_median:.2f}ms (min: {int8_min:.2f}ms)\")\n",
+        "    print(f\"Speedup: {speedup:.2f}x\")\n",
+        "    \n",
+        "    results[device] = {\n",
+        "        \"fp16_median_ms\": fp16_median,\n",
+        "        \"int8_median_ms\": int8_median,\n",
+        "        \"speedup\": speedup\n",
+        "    }\n",
+        "\n",
+        "print(\"\\n\" + \"=\"*70)\n",
+        "print(\"SUMMARY\")\n",
+        "print(\"=\"*70)\n",
+        "print(f\"\\nModel sizes:\")\n",
+        "print(f\"  FP16: {fp16_size:.2f} MB\")\n",
+        "print(f\"  INT8: {int8_size:.2f} MB\")\n",
+        "print(f\"  Compression: {fp16_size/int8_size:.2f}x\")\n",
+        "\n",
+        "print(f\"\\nAccuracy (vs PyTorch):\")\n",
+        "print(f\"  FP16: {fp16_vs_pytorch:.2f}%\")\n",
+        "print(f\"  INT8: {int8_vs_pytorch:.2f}%\")\n",
+        "\n",
+        "\n",
+        "print(\"=\"*70)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 54,
+      "id": "7b255478",
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Saving test data for benchmark scripts...\n",
+            "\u2713 10s data: (1, 998, 128)\n",
+            "\u2713 20s data: (1, 1996, 128)\n",
+            "\u2713 30s data: (1, 2994, 128)\n",
+            "\n",
+            "Files saved for benchmark_medasr_durations.py\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Save test data for benchmark script\n",
+        "print(\"Saving test data for benchmark scripts...\")\n",
+        "\n",
+        "np.save('medasr_input_features_10s.npy', np_features)\n",
+        "np.save('medasr_attention_mask_10s.npy', np_mask)\n",
+        "\n",
+        "# Create 20s and 30s test data by padding\n",
+        "features_20s = np.pad(np_features, ((0,0), (0, SEQ_LEN), (0,0)), mode='edge')\n",
+        "mask_20s = np.pad(np_mask, ((0,0), (0, SEQ_LEN)), mode='constant', constant_values=0)\n",
+        "np.save('medasr_input_features_20s.npy', features_20s)\n",
+        "np.save('medasr_attention_mask_20s.npy', mask_20s)\n",
+        "\n",
+        "features_30s = np.pad(np_features, ((0,0), (0, SEQ_LEN*2), (0,0)), mode='edge')\n",
+        "mask_30s = np.pad(np_mask, ((0,0), (0, SEQ_LEN*2)), mode='constant', constant_values=0)\n",
+        "np.save('medasr_input_features_30s.npy', features_30s)\n",
+        "np.save('medasr_attention_mask_30s.npy', mask_30s)\n",
+        "\n",
+        "print(f\"\u2713 10s data: {np_features.shape}\")\n",
+        "print(f\"\u2713 20s data: {features_20s.shape}\")  \n",
+        "print(f\"\u2713 30s data: {features_30s.shape}\")\n",
+        "print(\"\\nFiles saved for benchmark_medasr_durations.py\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "03d57cf9",
+      "metadata": {},
+      "source": [
+        "## Summary\n",
+        "\n",
+        "This notebook created optimized OpenVINO models for MedASR:\n",
+        "\n",
+        "**Generated Models:**\n",
+        "- `medasr_fp16.xml` - FP16 model for CPU/GPU inference\n",
+        "- `medasr_int8.xml` - INT8 quantized model with ~2x compression\n",
+        "\n",
+        "**Key Results:**\n",
+        "- Static model shape: `[1, 998, 128]` (optimized for 10s audio)\n",
+        "- INT8 quantization using real audio calibration data\n",
+        "- GPU acceleration with LATENCY performance hint\n",
+        "\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "base",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.13.12"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}
\ No newline at end of file