diff --git a/notebooks/medasr-medical-asr/README.md b/notebooks/medasr-medical-asr/README.md new file mode 100644 index 00000000000..958f35a1dbd --- /dev/null +++ b/notebooks/medasr-medical-asr/README.md @@ -0,0 +1,54 @@ +# MedASR Medical Speech Recognition with OpenVINO + +This notebook demonstrates converting Google's MedASR (Medical Automatic Speech Recognition) model to OpenVINO format with FP16 and INT8 quantization for efficient medical speech-to-text transcription. + +## Overview + +MedASR is a specialized speech recognition model optimized for medical terminology. This tutorial shows how to: + +- Load the MedASR model from HuggingFace +- Convert it to OpenVINO IR format for optimal inference performance +- Apply INT8 quantization using NNCF for model compression +- Compare accuracy and performance across PyTorch, FP16, and INT8 versions + +## Key Features + +- **Model Compression**: 3.9x size reduction (402 MB → 102 MB) with INT8 quantization +- **High Accuracy**: 97.98% token-level accuracy maintained after INT8 quantization +- **Medical Terminology**: Optimized for accurate medical speech recognition + +## Tutorial Contents + +1. **Installation** - Install required packages (OpenVINO, NNCF, Transformers, etc.) +2. **Load Model** - Load Google's MedASR model from HuggingFace +3. **Prepare Audio Data** - Download and preprocess test audio (optimized for 10s chunks) +4. **PyTorch Inference** - Establish baseline accuracy with original model +5. **Convert to OpenVINO FP16** - Convert using torch.export and ov.convert_model +6. **INT8 Quantization** - Apply NNCF quantization with real audio calibration +7. **Accuracy Comparison** - Validate quantization quality across all versions +8. **Performance Benchmarking** - Measure inference speed on CPU and GPU + +## Results + +- **Model Size**: 402 MB (FP16) → 102 MB (INT8) = **3.9x compression** +- **Accuracy**: 97.98% token match between INT8 and PyTorch +- **Model Shape**: Static [1, 998, 128] optimized for 10-second audio chunks + +## Installation + +```bash +pip install -q "openvino>=2024.4.0" "nncf>=2.13.0" "torch>=2.1" "transformers>=5.4.0" "librosa" "soundfile" "huggingface_hub" +``` + +## Important Notes + +⚠️ **Gated Model Access**: The MedASR model is gated on HuggingFace. You must: +1. Request access at https://huggingface.co/google/medasr +2. Authenticate with your HuggingFace token before running the notebook + +## Use Cases + +- Medical transcription systems +- Clinical documentation automation +- Healthcare voice assistants +- Medical education and training platforms diff --git a/notebooks/medasr-medical-asr/medasr-medical-asr.ipynb b/notebooks/medasr-medical-asr/medasr-medical-asr.ipynb new file mode 100644 index 00000000000..51de91ce0f9 --- /dev/null +++ b/notebooks/medasr-medical-asr/medasr-medical-asr.ipynb @@ -0,0 +1,948 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d2191f78", + "metadata": {}, + "source": [ + "# MedASR Medical Speech Recognition with OpenVINO", + "", + "This notebook demonstrates converting Google's MedASR (Medical Automatic Speech Recognition) model to OpenVINO format with FP16 and INT8 quantization.", + "", + "", + "**Table of Contents:**", + "1. [Installation](#installation)", + "2. [Login to HuggingFace](#login-huggingface)", + "3. [Load Model](#load-model)", + "4. [Prepare Audio Data](#prepare-audio)", + "5. [PyTorch Inference](#pytorch-inference)", + "6. [Convert to OpenVINO FP16](#convert-fp16)", + "7. [INT8 Quantization](#int8-quantization)", + "8. [Accuracy Comparison](#accuracy-comparison)", + "9. [Performance Benchmarking](#benchmarking)" + ] + }, + { + "cell_type": "markdown", + "id": "17b070bc", + "metadata": {}, + "source": [ + "## 1. Installation \n", + "\n", + "Install required packages for model conversion and optimization." + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "00e8dfeb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install -q \"openvino>=2024.4.0\" \"nncf>=2.13.0\" \"torch>=2.1\" \"transformers>=5.4.0\" \"librosa\" \"soundfile\" \"huggingface_hub\" \"matplotlib\" \"numpy\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Login to HuggingFace \n", + "\n", + "To run the model, you must be a registered user in \ud83e\udd17 [Hugging Face Hub](https://huggingface.co/). \n", + "\n", + "The MedASR model is gated and requires you to:\n", + "1. Visit the [MedASR model card](https://huggingface.co/google/medasr)\n", + "2. Carefully read the terms of usage\n", + "3. Click the accept button to agree to the license\n", + "\n", + "You will need to use an access token for the code below to run. For more information on access tokens, refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).\n", + "\n", + "You can login to Hugging Face Hub using the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Login to HuggingFace Hub to get access to the pretrained model\n", + "\n", + "from huggingface_hub import notebook_login, whoami\n", + "\n", + "try:\n", + " whoami()\n", + " print('Authorization token already provided')\n", + "except OSError:\n", + " notebook_login()" + ] + }, + { + "cell_type": "markdown", + "id": "9af6c1e8", + "metadata": {}, + "source": [ + "## 3. Load Model \n", + "\n", + "Load Google's MedASR model from HuggingFace. This is a CTC-based ASR model optimized for medical terminology." + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "02634872", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading model: google/medasr\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "b34851db22814704b3ef571ba747d64e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading weights: 0%| | 0/368 [00:00, ?it/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u2713 Model loaded: LasrForCTC\n", + "\u2713 Feature extractor: LasrFeatureExtractor\n", + "\u2713 Tokenizer vocab size: 512\n" + ] + } + ], + "source": [ + "from transformers import pipeline\n", + "import huggingface_hub\n", + "import librosa\n", + "import numpy as np\n", + "import torch\n", + "from pathlib import Path\n", + "import time\n", + "\n", + "MODEL_ID = \"google/medasr\"\n", + "print(f\"Loading model: {MODEL_ID}\")\n", + "\n", + "# Load model using pipeline\n", + "pipe = pipeline(\"automatic-speech-recognition\", model=MODEL_ID, trust_remote_code=True)\n", + "\n", + "# Extract model components\n", + "model = pipe.model\n", + "feature_extractor = pipe.feature_extractor\n", + "tokenizer = pipe.tokenizer\n", + "\n", + "print(f\"\u2713 Model loaded: {type(model).__name__}\")\n", + "print(f\"\u2713 Feature extractor: {type(feature_extractor).__name__}\")\n", + "print(f\"\u2713 Tokenizer vocab size: {tokenizer.vocab_size}\")" + ] + }, + { + "cell_type": "markdown", + "id": "d1b7f2e7", + "metadata": {}, + "source": [ + "## 4. Prepare Audio Data \n", + "\n", + "Download test audio and prepare it for model conversion. We use **10-second audio** for optimal GPU performance.\n", + "\n", + "- Creates model with shape `[1, 998, 128]`\n", + "- Longer audio can be processed via chunking" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "e1f5c284", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Full audio duration: 43.80 seconds\n", + "Optimized audio duration: 10.00 seconds\n", + "Sample rate: 16000 Hz\n", + "\n", + "\u2713 Input features shape: torch.Size([1, 998, 128])\n", + "\u2713 Attention mask shape: torch.Size([1, 998])\n", + "\u2713 Model will be created with static shape: [1, 998, 128]\n" + ] + } + ], + "source": [ + "# Download test audio from HuggingFace\n", + "audio_file = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')\n", + "speech_full, sample_rate = librosa.load(audio_file, sr=16000)\n", + "\n", + "print(f\"Full audio duration: {len(speech_full)/sample_rate:.2f} seconds\")\n", + "\n", + "# Use 10s audio for optimal model shape\n", + "OPTIMAL_DURATION = 10.0\n", + "speech_10s = speech_full[:int(OPTIMAL_DURATION * sample_rate)]\n", + "\n", + "print(f\"Optimized audio duration: {len(speech_10s)/sample_rate:.2f} seconds\")\n", + "print(f\"Sample rate: {sample_rate} Hz\")\n", + "\n", + "# Extract features for model conversion\n", + "inputs = feature_extractor(speech_10s, sampling_rate=sample_rate, return_tensors=\"pt\", \n", + " padding=True, return_attention_mask=True)\n", + "\n", + "input_features = inputs.input_features\n", + "attention_mask = inputs.attention_mask.to(torch.float32)\n", + "\n", + "SEQ_LEN = input_features.shape[1]\n", + "FEATURE_DIM = input_features.shape[2]\n", + "\n", + "print(f\"\\n\u2713 Input features shape: {input_features.shape}\")\n", + "print(f\"\u2713 Attention mask shape: {attention_mask.shape}\")\n", + "print(f\"\u2713 Model will be created with static shape: [1, {SEQ_LEN}, {FEATURE_DIM}]\")" + ] + }, + { + "cell_type": "markdown", + "id": "15d8fbe5", + "metadata": {}, + "source": [ + "## 5. PyTorch Inference \n", + "\n", + "Run inference with PyTorch model to establish baseline accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "e69f2d24", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "PyTorch Inference Results:\n", + "Transcription: [EXAM TYPE] CT chest PE protocol {period} [INDICATION] 54-year-old female, shortness of breath, evaluate for PE {period}TECchHNIQe\n", + "Logits shape: (1, 247, 512)\n", + "Logits range: [-26.49, 24.95]\n" + ] + } + ], + "source": [ + "# PyTorch inference\n", + "model.eval()\n", + "with torch.no_grad():\n", + " pt_outputs = model(input_features, attention_mask=attention_mask.long())\n", + " pt_logits = pt_outputs.logits.numpy()\n", + " pt_ids = np.argmax(pt_logits, axis=-1)\n", + " pt_transcription = tokenizer.batch_decode(pt_ids)[0]\n", + "\n", + "print(\"PyTorch Inference Results:\")\n", + "print(f\"Transcription: {pt_transcription}\")\n", + "print(f\"Logits shape: {pt_logits.shape}\")\n", + "print(f\"Logits range: [{pt_logits.min():.2f}, {pt_logits.max():.2f}]\")" + ] + }, + { + "cell_type": "markdown", + "id": "52e069a7", + "metadata": {}, + "source": [ + "## 6. Convert to OpenVINO FP16 \n", + "\n", + "Convert the PyTorch model to OpenVINO IR format using `torch.export` and `ov.convert_model`." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "6894c65f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Converting PyTorch model to OpenVINO IR...\n", + "Input shape: torch.Size([1, 998, 128])\n", + "\u2713 Model exported with torch.export\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/user/miniforge3/lib/python3.13/copyreg.py:99: FutureWarning: `isinstance(treespec, LeafSpec)` is deprecated, use `isinstance(treespec, TreeSpec) and treespec.is_leaf()` instead.\n", + " return cls.__new__(cls, *args)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u2713 Model reshaped to static: [1, 998, 128]\n", + "\n", + "\u2713 FP16 model saved: medasr_fp16.xml\n", + "\u2713 Model size: 402.71 MB\n", + "\n", + "Model inputs:\n", + " input_features: [1,998,128]\n", + " attention_mask: [1,998]\n" + ] + } + ], + "source": [ + "import openvino as ov\n", + "import os\n", + "\n", + "FP16_MODEL_PATH = Path(\"medasr_fp16.xml\")\n", + "\n", + "# Create model wrapper for clean export\n", + "class MedASRWrapper(torch.nn.Module):\n", + " def __init__(self, model):\n", + " super().__init__()\n", + " self.model = model\n", + " \n", + " def forward(self, input_features, attention_mask):\n", + " outputs = self.model(input_features=input_features, attention_mask=attention_mask)\n", + " return outputs.logits\n", + "\n", + "wrapped_model = MedASRWrapper(model)\n", + "wrapped_model.eval()\n", + "\n", + "print(\"Converting PyTorch model to OpenVINO IR...\")\n", + "print(f\"Input shape: {input_features.shape}\")\n", + "\n", + "with torch.no_grad():\n", + " # Export using torch.export\n", + " exported = torch.export.export(\n", + " wrapped_model,\n", + " (input_features, attention_mask)\n", + " )\n", + " print(\"\u2713 Model exported with torch.export\")\n", + " \n", + " # Convert to OpenVINO\n", + " ov_model = ov.convert_model(exported)\n", + " \n", + " # Reshape to static shape for optimal GPU performance\n", + " ov_model.reshape({\n", + " 'input_features': [1, SEQ_LEN, FEATURE_DIM],\n", + " 'attention_mask': [1, SEQ_LEN]\n", + " })\n", + " print(f\"\u2713 Model reshaped to static: [1, {SEQ_LEN}, {FEATURE_DIM}]\")\n", + "\n", + "# Save FP16 model (without FP16 compression to avoid GPU numerical issues)\n", + "ov.save_model(ov_model, FP16_MODEL_PATH, compress_to_fp16=False)\n", + "\n", + "fp16_size = (os.path.getsize(FP16_MODEL_PATH) + os.path.getsize(FP16_MODEL_PATH.with_suffix('.bin'))) / 1024 / 1024\n", + "print(f\"\\n\u2713 FP16 model saved: {FP16_MODEL_PATH}\")\n", + "print(f\"\u2713 Model size: {fp16_size:.2f} MB\")\n", + "\n", + "# Verify model inputs\n", + "print(\"\\nModel inputs:\")\n", + "for inp in ov_model.inputs:\n", + " print(f\" {inp.get_any_name()}: {inp.partial_shape}\")" + ] + }, + { + "cell_type": "markdown", + "id": "58ecb7a5", + "metadata": {}, + "source": [ + "## 7. INT8 Quantization \n", + "\n", + "Quantize the model to INT8 using NNCF with **real audio data** for calibration.\n", + "\n", + "**Key Settings:**\n", + "- `ModelType.TRANSFORMER` - Optimized for transformer models\n", + "- Real audio calibration data - Better accuracy than random data\n", + "- `fast_bias_correction` - Faster quantization with good results" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "f53fdd7b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Preparing calibration data from real audio...\n", + "\u2713 Created 100 calibration samples\n", + "\n", + "Quantizing to INT8 with TRANSFORMER preset...\n", + "This may take a few minutes...\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "b365a953a36c4e6eb426ab3a6391e10e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Output()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n" + ], + "text/plain": [] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "0f684296f1e04d90b8291eec9f0e3cb2", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Output()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n" + ], + "text/plain": [] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "8599299331864fda878f1a3a6771a45b", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Output()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n" + ], + "text/plain": [] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "dddda89a90154b648c15889d1bebf978", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Output()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n" + ], + "text/plain": [] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u2713 Quantization complete!\n", + "\n", + "Quantized model inputs:\n", + " input_features: [1,998,128]\n", + " attention_mask: [1,998]\n", + "\n", + "\u2713 INT8 model saved: medasr_int8.xml\n", + "\u2713 Model size: 103.51 MB\n", + "\u2713 Compression ratio: 3.89x\n" + ] + } + ], + "source": [ + "import nncf\n", + "from nncf import Dataset\n", + "\n", + "INT8_MODEL_PATH = Path(\"medasr_int8.xml\")\n", + "\n", + "print(\"Preparing calibration data from real audio...\")\n", + "\n", + "# Create calibration data from the test audio with variations\n", + "calibration_data = []\n", + "\n", + "# Use the real audio features as base\n", + "base_features = input_features.numpy().astype(np.float32)\n", + "base_mask = attention_mask.numpy().astype(np.float32)\n", + "\n", + "# Add the original sample\n", + "calibration_data.append({\n", + " 'input_features': base_features,\n", + " 'attention_mask': base_mask\n", + "})\n", + "\n", + "# Create variations with realistic audio augmentations\n", + "np.random.seed(42)\n", + "for i in range(99): # Total 100 calibration samples\n", + " # Add small realistic noise (simulates different recording conditions)\n", + " noise_level = np.random.uniform(0.001, 0.02)\n", + " noisy_features = base_features + np.random.randn(*base_features.shape).astype(np.float32) * noise_level\n", + " \n", + " # Slight volume variation\n", + " volume_scale = np.random.uniform(0.8, 1.2)\n", + " noisy_features = noisy_features * volume_scale\n", + " \n", + " calibration_data.append({\n", + " 'input_features': noisy_features,\n", + " 'attention_mask': base_mask.copy()\n", + " })\n", + "\n", + "print(f\"\u2713 Created {len(calibration_data)} calibration samples\")\n", + "\n", + "# Create NNCF dataset\n", + "def transform_fn(data_item):\n", + " return {\n", + " 'input_features': data_item['input_features'],\n", + " 'attention_mask': data_item['attention_mask']\n", + " }\n", + "\n", + "calibration_dataset = Dataset(calibration_data, transform_fn)\n", + "\n", + "print(\"\\nQuantizing to INT8 with TRANSFORMER preset...\")\n", + "print(\"This may take a few minutes...\")\n", + "\n", + "quantized_model = nncf.quantize(\n", + " model=ov_model,\n", + " calibration_dataset=calibration_dataset,\n", + " subset_size=min(100, len(calibration_data)),\n", + " model_type=nncf.ModelType.TRANSFORMER,\n", + " fast_bias_correction=True\n", + ")\n", + "\n", + "print(\"\u2713 Quantization complete!\")\n", + "\n", + "# Verify INT8 model inputs\n", + "print(\"\\nQuantized model inputs:\")\n", + "for inp in quantized_model.inputs:\n", + " print(f\" {inp.get_any_name()}: {inp.partial_shape}\")\n", + "\n", + "# Save INT8 model\n", + "ov.save_model(quantized_model, INT8_MODEL_PATH, compress_to_fp16=False)\n", + "\n", + "int8_size = (os.path.getsize(INT8_MODEL_PATH) + os.path.getsize(INT8_MODEL_PATH.with_suffix('.bin'))) / 1024 / 1024\n", + "print(f\"\\n\u2713 INT8 model saved: {INT8_MODEL_PATH}\")\n", + "print(f\"\u2713 Model size: {int8_size:.2f} MB\")\n", + "print(f\"\u2713 Compression ratio: {fp16_size/int8_size:.2f}x\")" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "2a72831c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Quantized model statistics:\n", + " FakeQuantize ops: 192\n", + " Convolution ops: 37\n", + " MatMul ops: 138\n", + " Total ops: 4053\n" + ] + } + ], + "source": [ + "# Display quantization statistics\n", + "op_types = {}\n", + "for op in quantized_model.get_ops():\n", + " op_type = op.get_type_name()\n", + " op_types[op_type] = op_types.get(op_type, 0) + 1\n", + "\n", + "print(\"Quantized model statistics:\")\n", + "print(f\" FakeQuantize ops: {op_types.get('FakeQuantize', 0)}\")\n", + "print(f\" Convolution ops: {op_types.get('Convolution', 0)}\")\n", + "print(f\" MatMul ops: {op_types.get('MatMul', 0)}\")\n", + "print(f\" Total ops: {sum(op_types.values())}\")" + ] + }, + { + "cell_type": "markdown", + "id": "d706d283", + "metadata": {}, + "source": [ + "## 8. Accuracy Comparison \n", + "\n", + "Compare accuracy of PyTorch, FP16, and INT8 models to ensure quantization quality." + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "4cd877eb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "======================================================================\n", + "ACCURACY COMPARISON: PyTorch vs FP16 vs INT8\n", + "======================================================================\n", + "\n", + "Compiling models for GPU...\n", + "\n", + "--- Transcriptions ---\n", + "PyTorch: [EXAM TYPE] CT chest PE protocol {period} [INDICATION] 54-year-old female, shortness of breath, evaluate for PE {period}TECchHNIQe\n", + "FP16: [EXAM TYPE] CT chest PE protocol {period} [INDICATION] 54-year-old female, shortness of breath, evaluate for PE {period}TECchHNIQe\n", + "INT8: [EXAM TYPE] CT chest PE protocol {period} [INDICATION] 54-year-old female, shortness of breath, evaluate for PE {period}TECchHNiQe\n", + "\n", + "--- Token Match Accuracy ---\n", + "FP16 vs PyTorch: 100.00%\n", + "INT8 vs PyTorch: 98.38%\n", + "INT8 vs FP16: 98.38%\n", + "\n", + "--- Logit Correlation ---\n", + "FP16 vs PyTorch: 1.000000\n", + "INT8 vs PyTorch: 0.996360\n", + "\n", + "======================================================================\n", + "\u2713 ACCURACY CHECK PASSED\n", + "======================================================================\n" + ] + } + ], + "source": [ + "import openvino as ov\n", + "\n", + "print(\"=\"*70)\n", + "print(\"ACCURACY COMPARISON: PyTorch vs FP16 vs INT8\")\n", + "print(\"=\"*70)\n", + "\n", + "core = ov.Core()\n", + "\n", + "# Prepare input data\n", + "np_features = input_features.numpy().astype(np.float32)\n", + "np_mask = attention_mask.numpy().astype(np.float32)\n", + "\n", + "# Compile models for CPU (most accurate)\n", + "print(\"\\nCompiling models for GPU...\")\n", + "fp16_compiled = core.compile_model(FP16_MODEL_PATH, \"GPU\", {\"PERFORMANCE_HINT\": \"LATENCY\", \"INFERENCE_PRECISION_HINT\": \"f32\"})\n", + "int8_compiled = core.compile_model(INT8_MODEL_PATH, \"GPU\", {\"PERFORMANCE_HINT\": \"LATENCY\", \"INFERENCE_PRECISION_HINT\": \"f32\"})\n", + "\n", + "# FP16 inference\n", + "fp16_out = fp16_compiled({\"input_features\": np_features, \"attention_mask\": np_mask})\n", + "fp16_logits = fp16_out[0]\n", + "fp16_ids = np.argmax(fp16_logits, axis=-1)\n", + "fp16_text = tokenizer.batch_decode(fp16_ids)[0]\n", + "\n", + "# INT8 inference\n", + "int8_out = int8_compiled({\"input_features\": np_features, \"attention_mask\": np_mask})\n", + "int8_logits = int8_out[0]\n", + "int8_ids = np.argmax(int8_logits, axis=-1)\n", + "int8_text = tokenizer.batch_decode(int8_ids)[0]\n", + "\n", + "print(\"\\n--- Transcriptions ---\")\n", + "print(f\"PyTorch: {pt_transcription}\")\n", + "print(f\"FP16: {fp16_text}\")\n", + "print(f\"INT8: {int8_text}\")\n", + "\n", + "# Calculate accuracy metrics\n", + "def calculate_accuracy(ref_ids, hyp_ids):\n", + " return np.mean(ref_ids == hyp_ids) * 100\n", + "\n", + "fp16_vs_pytorch = calculate_accuracy(pt_ids, fp16_ids)\n", + "int8_vs_pytorch = calculate_accuracy(pt_ids, int8_ids)\n", + "int8_vs_fp16 = calculate_accuracy(fp16_ids, int8_ids)\n", + "\n", + "print(\"\\n--- Token Match Accuracy ---\")\n", + "print(f\"FP16 vs PyTorch: {fp16_vs_pytorch:.2f}%\")\n", + "print(f\"INT8 vs PyTorch: {int8_vs_pytorch:.2f}%\")\n", + "print(f\"INT8 vs FP16: {int8_vs_fp16:.2f}%\")\n", + "\n", + "# Logit correlation\n", + "fp16_corr = np.corrcoef(pt_logits.flatten(), fp16_logits.flatten())[0, 1]\n", + "int8_corr = np.corrcoef(pt_logits.flatten(), int8_logits.flatten())[0, 1]\n", + "\n", + "print(\"\\n--- Logit Correlation ---\")\n", + "print(f\"FP16 vs PyTorch: {fp16_corr:.6f}\")\n", + "print(f\"INT8 vs PyTorch: {int8_corr:.6f}\")\n", + "\n", + "print(\"\\n\" + \"=\"*70)\n", + "if fp16_vs_pytorch >= 99.0 and int8_vs_pytorch >= 95.0:\n", + " print(\"\u2713 ACCURACY CHECK PASSED\")\n", + "else:\n", + " print(\"\u26a0 ACCURACY CHECK: Review results above\")\n", + "print(\"=\"*70)" + ] + }, + { + "cell_type": "markdown", + "id": "2f345564", + "metadata": {}, + "source": [ + "## 9. Performance Benchmarking \n", + "\n", + "Benchmark FP16 and INT8 models on GPU and CPU.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "2a34bc9a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "======================================================================\n", + "PERFORMANCE BENCHMARKING\n", + "======================================================================\n", + "Available devices: ['CPU', 'GPU', 'NPU']\n", + "\n", + "--- GPU Benchmarks ---\n", + "FP16: 38.78ms (min: 37.15ms)\n", + "INT8: 6.57ms (min: 6.41ms)\n", + "Speedup: 5.90x\n", + "\n", + "--- CPU Benchmarks ---\n", + "FP16: 140.30ms (min: 138.57ms)\n", + "INT8: 45.81ms (min: 45.39ms)\n", + "Speedup: 3.06x\n", + "\n", + "======================================================================\n", + "SUMMARY\n", + "======================================================================\n", + "\n", + "Model sizes:\n", + " FP16: 402.71 MB\n", + " INT8: 103.51 MB\n", + " Compression: 3.89x\n", + "\n", + "Accuracy (vs PyTorch):\n", + " FP16: 100.00%\n", + " INT8: 98.38%\n", + "======================================================================\n" + ] + } + ], + "source": [ + "print(\"=\"*70)\n", + "print(\"PERFORMANCE BENCHMARKING\")\n", + "print(\"=\"*70)\n", + "\n", + "core = ov.Core()\n", + "available_devices = core.available_devices\n", + "print(f\"Available devices: {available_devices}\")\n", + "\n", + "results = {}\n", + "\n", + "# Benchmark configurations\n", + "devices_to_test = [\"GPU\", \"CPU\"] if \"GPU\" in available_devices else [\"CPU\"]\n", + "\n", + "for device in devices_to_test:\n", + " print(f\"\\n--- {device} Benchmarks ---\")\n", + " \n", + " # Device-specific config\n", + " \n", + " config = {\"PERFORMANCE_HINT\": \"LATENCY\"}\n", + " if device == \"GPU\":\n", + " config[\"INFERENCE_PRECISION_HINT\"] = \"f32\"\n", + " \n", + " \n", + " # FP16 benchmark\n", + " fp16_model = core.compile_model(FP16_MODEL_PATH, device, config)\n", + " \n", + " # Warmup\n", + " for _ in range(10):\n", + " fp16_model({\"input_features\": np_features, \"attention_mask\": np_mask})\n", + " \n", + " # Benchmark\n", + " fp16_latencies = []\n", + " for _ in range(100):\n", + " start = time.time()\n", + " fp16_model({\"input_features\": np_features, \"attention_mask\": np_mask})\n", + " fp16_latencies.append((time.time() - start) * 1000)\n", + " \n", + " fp16_median = np.median(fp16_latencies)\n", + " fp16_min = np.min(fp16_latencies)\n", + " \n", + " # INT8 benchmark\n", + " int8_model = core.compile_model(INT8_MODEL_PATH, device, config)\n", + " \n", + " # Warmup\n", + " for _ in range(10):\n", + " int8_model({\"input_features\": np_features, \"attention_mask\": np_mask})\n", + " \n", + " # Benchmark\n", + " int8_latencies = []\n", + " for _ in range(100):\n", + " start = time.time()\n", + " int8_model({\"input_features\": np_features, \"attention_mask\": np_mask})\n", + " int8_latencies.append((time.time() - start) * 1000)\n", + " \n", + " int8_median = np.median(int8_latencies)\n", + " int8_min = np.min(int8_latencies)\n", + " \n", + " speedup = fp16_median / int8_median\n", + " \n", + " print(f\"FP16: {fp16_median:.2f}ms (min: {fp16_min:.2f}ms)\")\n", + " print(f\"INT8: {int8_median:.2f}ms (min: {int8_min:.2f}ms)\")\n", + " print(f\"Speedup: {speedup:.2f}x\")\n", + " \n", + " results[device] = {\n", + " \"fp16_median_ms\": fp16_median,\n", + " \"int8_median_ms\": int8_median,\n", + " \"speedup\": speedup\n", + " }\n", + "\n", + "print(\"\\n\" + \"=\"*70)\n", + "print(\"SUMMARY\")\n", + "print(\"=\"*70)\n", + "print(f\"\\nModel sizes:\")\n", + "print(f\" FP16: {fp16_size:.2f} MB\")\n", + "print(f\" INT8: {int8_size:.2f} MB\")\n", + "print(f\" Compression: {fp16_size/int8_size:.2f}x\")\n", + "\n", + "print(f\"\\nAccuracy (vs PyTorch):\")\n", + "print(f\" FP16: {fp16_vs_pytorch:.2f}%\")\n", + "print(f\" INT8: {int8_vs_pytorch:.2f}%\")\n", + "\n", + "\n", + "print(\"=\"*70)" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "7b255478", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Saving test data for benchmark scripts...\n", + "\u2713 10s data: (1, 998, 128)\n", + "\u2713 20s data: (1, 1996, 128)\n", + "\u2713 30s data: (1, 2994, 128)\n", + "\n", + "Files saved for benchmark_medasr_durations.py\n" + ] + } + ], + "source": [ + "# Save test data for benchmark script\n", + "print(\"Saving test data for benchmark scripts...\")\n", + "\n", + "np.save('medasr_input_features_10s.npy', np_features)\n", + "np.save('medasr_attention_mask_10s.npy', np_mask)\n", + "\n", + "# Create 20s and 30s test data by padding\n", + "features_20s = np.pad(np_features, ((0,0), (0, SEQ_LEN), (0,0)), mode='edge')\n", + "mask_20s = np.pad(np_mask, ((0,0), (0, SEQ_LEN)), mode='constant', constant_values=0)\n", + "np.save('medasr_input_features_20s.npy', features_20s)\n", + "np.save('medasr_attention_mask_20s.npy', mask_20s)\n", + "\n", + "features_30s = np.pad(np_features, ((0,0), (0, SEQ_LEN*2), (0,0)), mode='edge')\n", + "mask_30s = np.pad(np_mask, ((0,0), (0, SEQ_LEN*2)), mode='constant', constant_values=0)\n", + "np.save('medasr_input_features_30s.npy', features_30s)\n", + "np.save('medasr_attention_mask_30s.npy', mask_30s)\n", + "\n", + "print(f\"\u2713 10s data: {np_features.shape}\")\n", + "print(f\"\u2713 20s data: {features_20s.shape}\") \n", + "print(f\"\u2713 30s data: {features_30s.shape}\")\n", + "print(\"\\nFiles saved for benchmark_medasr_durations.py\")" + ] + }, + { + "cell_type": "markdown", + "id": "03d57cf9", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "This notebook created optimized OpenVINO models for MedASR:\n", + "\n", + "**Generated Models:**\n", + "- `medasr_fp16.xml` - FP16 model for CPU/GPU inference\n", + "- `medasr_int8.xml` - INT8 quantized model with ~2x compression\n", + "\n", + "**Key Results:**\n", + "- Static model shape: `[1, 998, 128]` (optimized for 10s audio)\n", + "- INT8 quantization using real audio calibration data\n", + "- GPU acceleration with LATENCY performance hint\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file