|
| 1 | +--- |
| 2 | +# User change |
| 3 | +title: "Setup the Whisper Model" |
| 4 | + |
| 5 | +weight: 2 |
| 6 | + |
| 7 | +# Do not modify these elements |
| 8 | +layout: "learningpathall" |
| 9 | +--- |
| 10 | + |
| 11 | +## Before you begin |
| 12 | + |
| 13 | +This Learning Path demonstrates how to run the whisper-large-v3-turbo model as an application that takes the audio input and computes out the text transcript of it. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 24.04 LTS. You need an Arm server instance with 32 cores, atleast 8GB of RAM and 32GB disk to run this example. The instructions have been tested on a AWS c8g.8xlarge instance. |
| 14 | + |
| 15 | +## Overview |
| 16 | + |
| 17 | +OpenAI Whisper is an open-source Automatic Speech Recognition (ASR) model trained on the multilingual and multitask data, which enables the transcript generation in multiple languages and translations from different languages to English. We will explore the foundational aspects of speech-to-text transcription applications, specifically focusing on running OpenAI’s Whisper on an Arm CPU. We will discuss the implementation and performance considerations required to efficiently deploy Whisper using Hugging Face Transformers framework. |
| 18 | + |
| 19 | +## Install dependencies |
| 20 | + |
| 21 | +Install the following packages on your Arm based server instance: |
| 22 | + |
| 23 | +```bash |
| 24 | +sudo apt update |
| 25 | +sudo apt install python3-pip python3-venv ffmpeg -y |
| 26 | +``` |
| 27 | + |
| 28 | +## Install Python Dependencies |
| 29 | + |
| 30 | +Create a Python virtual environment: |
| 31 | + |
| 32 | +```bash |
| 33 | +python3 -m venv whisper-env |
| 34 | +``` |
| 35 | + |
| 36 | +Activate the virtual environment: |
| 37 | + |
| 38 | +```bash |
| 39 | +source whisper-env/bin/activate |
| 40 | +``` |
| 41 | + |
| 42 | +Install the required libraries using pip: |
| 43 | + |
| 44 | +```python3 |
| 45 | +pip install torch transformers accelerate |
| 46 | +``` |
| 47 | + |
| 48 | +## Download the sample audio file |
| 49 | + |
| 50 | +Download a sample audio file, which is about 33sec audio in .wav format or use your own audio file: |
| 51 | +```bash |
| 52 | + wget https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav |
| 53 | +``` |
| 54 | + |
| 55 | +## Create a python script for audio to text transcription |
| 56 | + |
| 57 | +Create a python file: |
| 58 | + |
| 59 | +```bash |
| 60 | + vim whisper-application.py |
| 61 | +``` |
| 62 | + |
| 63 | +Write the following code in the `whisper-application.py` file: |
| 64 | +```python |
| 65 | +import torch |
| 66 | +from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
| 67 | +import time |
| 68 | + |
| 69 | +# Set the device to CPU and specify the torch data type |
| 70 | +device = "cpu" |
| 71 | +torch_dtype = torch.float32 |
| 72 | + |
| 73 | +# Specify the model name |
| 74 | +model_id = "openai/whisper-large-v3-turbo" |
| 75 | + |
| 76 | +# Load the model with specified configurations |
| 77 | +model = AutoModelForSpeechSeq2Seq.from_pretrained( |
| 78 | + model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True |
| 79 | +) |
| 80 | + |
| 81 | +# Move the model to the specified device |
| 82 | +model.to(device) |
| 83 | + |
| 84 | +# Load the processor for the model |
| 85 | +processor = AutoProcessor.from_pretrained(model_id) |
| 86 | + |
| 87 | +# Create a pipeline for automatic speech recognition |
| 88 | +pipe = pipeline( |
| 89 | + "automatic-speech-recognition", |
| 90 | + model=model, |
| 91 | + tokenizer=processor.tokenizer, |
| 92 | + feature_extractor=processor.feature_extractor, |
| 93 | + torch_dtype=torch_dtype, |
| 94 | + device=device, |
| 95 | + return_timestamps=True |
| 96 | +) |
| 97 | + |
| 98 | +# Record the start time of the inference |
| 99 | +start_time = time.time() |
| 100 | + |
| 101 | +# Perform speech recognition on the audio file |
| 102 | +result = pipe("OSR_us_000_0010_8k.wav") |
| 103 | + |
| 104 | +# Record the end time of the inference |
| 105 | +end_time = time.time() |
| 106 | + |
| 107 | +# Print the transcribed text |
| 108 | +print(f'\n{result["text"]}\n') |
| 109 | + |
| 110 | +# Calculate and print the duration of the inference |
| 111 | +duration = end_time - start_time |
| 112 | +hours = duration // 3600 |
| 113 | +minutes = (duration - (hours * 3600)) // 60 |
| 114 | +seconds = (duration - ((hours * 3600) + (minutes * 60))) |
| 115 | +msg = f'\nInferencing elapsed time: {seconds:4.2f} seconds\n' |
| 116 | + |
| 117 | +print(msg) |
| 118 | + |
| 119 | +``` |
| 120 | + |
| 121 | +## Use the Arm specific flags: |
| 122 | + |
| 123 | +Use the following flags to enable fast math GEMM kernels, Linux Transparent Huge Page (THP) allocations, logs to confirm kernel and set LRU cache capacity and OMP_NUM_THREADS to run the Whisper efficiently on Arm machines. |
| 124 | + |
| 125 | +```bash |
| 126 | + export DNNL_DEFAULT_FPMATH_MODE=BF16 |
| 127 | + export THP_MEM_ALLOC_ENABLE=1 |
| 128 | + export LRU_CACHE_CAPACITY=1024 |
| 129 | + export OMP_NUM_THREADS=32 |
| 130 | + export DNNL_VERBOSE=1 |
| 131 | +``` |
| 132 | +{{% notice Note %}} |
| 133 | +BF16 support is merged into PyTorch versions greater than 2.3.0. |
| 134 | +{{% /notice %}} |
0 commit comments