Skip to content

Commit 5121458

Browse files
authored
Merge pull request #1646 from annietllnd/whisper-lp
Technical review of Whisper LP
2 parents eca0118 + b871e58 commit 5121458

File tree

5 files changed

+61
-39
lines changed

5 files changed

+61
-39
lines changed

content/learning-paths/servers-and-cloud-computing/whisper/_index.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,25 +3,24 @@ title: Run OpenAI Whisper Audio Model efficiently on Arm with Hugging Face Trans
33

44
minutes_to_complete: 15
55

6-
who_is_this_for: This Learning Path is for software developers, ML engineers, and those looking to run Whisper ASR Model on Arm Neoverse based CPUs efficiently and build speech transcription based applications around it.
6+
who_is_this_for: This Learning Path is for software developers looking to run the Whisper automatic speech recognition (ASR) model efficiently. You will use an Arm-based cloud instance to run and build speech transcription based applications.
77

88
learning_objectives:
99
- Install the dependencies to run the Whisper Model
10-
- Run the OpenAI Whisper model using Hugging Face Transformers framework.
11-
- Run the whisper-large-v3-turbo model on Arm CPU efficiently.
12-
- Perform the audio to text transcription with Whisper.
13-
- Observe the total time taken to generate transcript with Whisper.
10+
- Run the OpenAI Whisper model using Hugging Face Transformers.
11+
- Enable performance-enhancing features for running the model on Arm CPUs.
12+
- Compare the total time taken to generate transcript with Whisper.
1413

1514

1615
prerequisites:
17-
- Amazon Graviton4 (or other Arm) compute instance with 32 cores, 8GB of RAM, and 32GB disk space.
16+
- An [Arm-based compute instance](/learning-paths/servers-and-cloud-computing/intro/) with 32 cores, 8GB of RAM, and 32GB disk space running Ubuntu.
1817
- Basic understanding of Python and ML concepts.
1918
- Understanding of Whisper ASR Model fundamentals.
2019

2120
author: Nobel Chowdary Mandepudi
2221

2322
### Tags
24-
skilllevels: Intermediate
23+
skilllevels: Introductory
2524
armips:
2625
- Neoverse
2726
subjects: ML
@@ -30,7 +29,15 @@ operatingsystems:
3029
tools_software_languages:
3130
- Python
3231
- Whisper
33-
- AWS Graviton
32+
cloud_service_providers: AWS
33+
34+
35+
further_reading:
36+
- resource:
37+
title: Hugging Face Transformers documentation
38+
link: https://huggingface.co/transformers/v4.11.3/index.html
39+
type: documentation
40+
3441

3542
### FIXED, DO NOT MODIFY
3643
# ================================================================================

content/learning-paths/servers-and-cloud-computing/whisper/whisper.md

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -10,19 +10,23 @@ layout: "learningpathall"
1010

1111
## Before you begin
1212

13-
This Learning Path demonstrates how to run the whisper-large-v3-turbo model as an application that takes the audio input and computes out the text transcript of it. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 24.04 LTS. You need an Arm server instance with 32 cores, atleast 8GB of RAM and 32GB disk to run this example. The instructions have been tested on a AWS c8g.8xlarge instance.
13+
This Learning Path demonstrates how to run the [whisper-large-v3-turbo model](https://huggingface.co/openai/whisper-large-v3-turbo) as an application that takes an audio input and computes the text transcript of it. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 24.04 LTS. You need an Arm server instance with 32 cores, atleast 8GB of RAM and 32GB disk to run this example. The instructions have been tested on a AWS Graviton4 `c8g.8xlarge` instance.
1414

1515
## Overview
1616

17-
OpenAI Whisper is an open-source Automatic Speech Recognition (ASR) model trained on the multilingual and multitask data, which enables the transcript generation in multiple languages and translations from different languages to English. We will explore the foundational aspects of speech-to-text transcription applications, specifically focusing on running OpenAI’s Whisper on an Arm CPU. We will discuss the implementation and performance considerations required to efficiently deploy Whisper using Hugging Face Transformers framework.
17+
OpenAI Whisper is an open-source Automatic Speech Recognition (ASR) model trained on the multilingual and multitask data, which enables the transcript generation in multiple languages and translations from different languages to English. You will learn about the foundational aspects of speech-to-text transcription applications, specifically focusing on running OpenAI’s Whisper on an Arm CPU. Lastly, you will explore the implementation and performance considerations required to efficiently deploy Whisper using Hugging Face Transformers framework.
18+
19+
### Speech-to-text ML applications
20+
21+
Speech-to-text (STT) transcription applications transform spoken language into written text, enabling voice-driven interfaces, accessibility tools, and real-time communication services. Audio is first cleaned and converted into a format suitable for processing, then passed through a deep learning model trained to recognize speech patterns. Advanced language models help refine the output, improving accuracy by predicting likely word sequences based on context. Whether running on cloud servers, STT applications must balance accuracy, latency, and computational efficiency to meet the needs of diverse use cases.
1822

1923
## Install dependencies
2024

2125
Install the following packages on your Arm based server instance:
2226

2327
```bash
2428
sudo apt update
25-
sudo apt install python3-pip python3-venv ffmpeg -y
29+
sudo apt install python3-pip python3-venv ffmpeg wget -y
2630
```
2731

2832
## Install Python Dependencies
@@ -47,21 +51,18 @@ pip install torch transformers accelerate
4751

4852
## Download the sample audio file
4953

50-
Download a sample audio file, which is about 33sec audio in .wav format or use your own audio file:
54+
Download a sample audio file, which is about 33 second audio in .wav format. You can use any .wav sound file if you'd like to try some other examples.
5155
```bash
52-
wget https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
56+
wget https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
5357
```
5458

5559
## Create a python script for audio to text transcription
5660

57-
Create a python file:
61+
You will use the Hugging Face `transformers` framework to help process the audio. It contains classes that configures the model, and prepares it for inference. `pipeline` is an end-to-end function for NLP tasks. In the code below, it's configured to do pre- and post-processing of the sample in this example, as well as running the actual inference.
5862

59-
```bash
60-
vim whisper-application.py
61-
```
63+
Using a file editor of your choice, create a python file named `whisper-application.py` with the content shown below:
6264

63-
Write the following code in the `whisper-application.py` file:
64-
```python
65+
```python { file_name="whisper-application.py" }
6566
import torch
6667
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
6768
import time
@@ -115,20 +116,18 @@ seconds = (duration - ((hours * 3600) + (minutes * 60)))
115116
msg = f'\nInferencing elapsed time: {seconds:4.2f} seconds\n'
116117

117118
print(msg)
118-
119119
```
120120

121-
## Use the Arm specific flags:
122-
123-
Use the following flags to enable fast math GEMM kernels, Linux Transparent Huge Page (THP) allocations, logs to confirm kernel and set LRU cache capacity and OMP_NUM_THREADS to run the Whisper efficiently on Arm machines.
121+
Enable verbose mode for the output and run the script:
124122

125123
```bash
126-
export DNNL_DEFAULT_FPMATH_MODE=BF16
127-
export THP_MEM_ALLOC_ENABLE=1
128-
export LRU_CACHE_CAPACITY=1024
129-
export OMP_NUM_THREADS=32
130-
export DNNL_VERBOSE=1
124+
export DNNL_VERBOSE=1
125+
python3 whisper-application.py
131126
```
132-
{{% notice Note %}}
133-
BF16 support is merged into PyTorch versions greater than 2.3.0.
134-
{{% /notice %}}
127+
128+
You should see output similar to the image below with a log output, transcript of the audio and the `Inference elapsed time`.
129+
130+
![frontend](whisper_output_no_flags.png)
131+
132+
133+
You've now run the Whisper model successfully on your Arm-based CPU. Continue to the next section to configure flags that can increase the performance your running model.

content/learning-paths/servers-and-cloud-computing/whisper/whisper_deploy.md

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,24 +5,40 @@ weight: 4
55
layout: learningpathall
66
---
77

8+
## Setting environment variables that impact performance
9+
10+
Speech-to-text applications often process large amounts of audio data in real time, requiring efficient computation to balance accuracy and speed. Low-level implementations of the kernels in the neural network enhance performance by reducing processing overhead. When tailored for specific hardware architectures, such as Arm CPUs, these kernels accelerate key tasks like feature extraction and neural network inference. Optimized kernels ensure that speech models like OpenAI’s Whisper can run efficiently, making high-quality transcription more accessible across various server applications.
11+
12+
Other considerations below allow us to use the memory more efficiently. Things like allocating additional memory and threads for a certain task can increase performance. By enabling these hardware-aware options, applications achieve lower latency, reduced power consumption, and smoother real-time transcription.
13+
14+
Use the following flags to enable fast math BFloat16(BF16) GEMM kernels, Linux Transparent Huge Page (THP) allocations, logs to confirm kernel and set LRU cache capacity and OMP_NUM_THREADS to run the Whisper efficiently on Arm machines.
15+
16+
```bash
17+
export DNNL_DEFAULT_FPMATH_MODE=BF16
18+
export THP_MEM_ALLOC_ENABLE=1
19+
export LRU_CACHE_CAPACITY=1024
20+
export OMP_NUM_THREADS=32
21+
```
22+
23+
{{% notice Note %}}
24+
BF16 support is merged into PyTorch versions greater than 2.3.0.
25+
{{% /notice %}}
26+
827
## Run Whisper File
9-
After installing the dependencies and enabling the Arm specific flags in the previous step, now lets run the Whisper model and analyze it.
28+
After setting the environment variables in the previous step, now lets run the Whisper model again and analyze the performance impact.
1029

1130
Run the `whisper-application.py` file:
1231

1332
```python
1433
python3 whisper-application.py
1534
```
1635

17-
## Output
36+
## Analyze output
1837

19-
You should see output similar to the image below with the log since we enabled verbose, transcript of the audio and the audio transcription time:
20-
![frontend](whisper_output.png)
38+
You should now observe that the processing time has gone down compared to the last run:
2139

22-
## Analyze
40+
![frontend](whisper_output.png)
2341

2442
The output in the above image has the log containing `attr-fpmath:bf16`, which confirms that fast math BF16 kernels are used in the compute process to improve the performance.
2543

26-
It also generated the text transcript of the audio and the `Inference elapsed time`.
27-
28-
By enabling the Arm specific flags as described in the learning path you can see the performance upliftment with the Whisper using Hugging Face Transformers framework on Arm.
44+
By enabling the environment variables as described in the learning path you can see the performance uplift with the Whisper using Hugging Face Transformers framework on Arm.
-92.8 KB
Loading
61.3 KB
Loading

0 commit comments

Comments
 (0)