This example demonstrates how you can use an ONNX model to transcribe audio. Post-processing by a Python program converts the tokens that the model outputs into words.
NOTE: Use this example with SAS Event Stream Processing 2025.04 and later.
For more information about how to install and use example projects, see Using the Examples.
Whisper is an open-weights model released in 2022 by OpenAI. Whisper transcribes accurately, even with background noise or multiple speakers. It has an extremely large vocabulary, even correctly transcribing proper nouns like "GitLab", file names like "requirements.txt", and acronyms like "ESP".
The Whisper model can be converted to ONNX and run it in SAS Event Stream Processing with minimal Python window code, avoiding the need to install additional packages.
rik_customwindow_16k.wavis the audio file used as the input for this example. When you have installed the example, you can access this audio in the Project Package pane in SAS Event Stream Processing Studio. The file is located in thetest_filesfolder. You can replace this file with your own WAV file, but it must have a sample rate of 16K.model.xmlis the project associated with this example.whisper_sm_int8_cpu.onnx(within the project package ZIP file) is an open source ONNX model for voice transcription.bpe_decoder.pycontains the mapping of tokens to words.warmup-whisper.safetensorscontains values used for model warm-up.savetensor-whisper.pycontains Python code to generate the warm-up tensors file.
For more information about the source of the above files, see Source of the Model and Video Files.
The example is configured to use the CPU as the execution provider, but could be configured to use CUDA (Compute Unified Device Architecture) as the execution provider. For more information, see Specifying Execution Providers.
This example is configured for batch transcription of WAV files. For live transcription from a microphone, you must use an ESP server that is running on an edge server and configure the audio connector.
The following figure shows the diagram of the project:
- w_audio is a Source window. This is where individual amplitude values from a WAV file or microphone are streamed in.
- w_preproc is a Python window. The Python code converts individual amplitude values into a tensor suitable for ONNX runtime.
- w_reader is a Model Reader window. This window reads the ONNX model and passes it to the w_score window.
- w_score is a Score window. This window executes the ONNX model when data passes through the window. The events from the w_reader window are pre-processed and converted to a tensor format. A tensor is an n-dimensional array that contains the same type of data. The output from the w_score window is another tensor.
- w_postproc is a Python window. The Python code converts the model output (tensor format) into numpy format, and then converts the tokens to text using a BPE (byte-pair encoding) decoder.
Explore the settings for this window:
- Click the w_audio window.
- In the right pane, expand State and Event Type.
The window is stateless and the index type ispi_EMPTY. This means the window acts as a pass-through for all incoming events. Thepi_EMPTYindex does not store events. - To examine the window's output schema, on the right toolbar, click
. Observe the following fields:
id: Each image that the window receives is assigned an ID, and this field holds information about the ID. This is a key field.audio: This field holds the audio amplitude values.
- Click
. - Expand Input Data (Publisher) Connectors.
Observe there are two connectors, and only one is selected. Thepythonconnector reads batch WAV files. Theaudioconnector, which is disabled in this example, streams data from the microphone. This requires an ESP server that runs on an edge device. - Double-click the
pythonVideo Capture connector.
The Connector Configuration window appears. The Code field shows that the window runs thepublish_fun_acgenerator, which is defined infiles/esp_whisper_funs.py. - Click Cancel to close the Connector Configuration window.
You can replace the WAV file with your own audio file.
Explore the settings for this window:
- Click the w_preproc window.
- In the right pane, expand Python Settings.
- Scroll down to the Code source field and click View file.
This window calls the
preprocfunction from thefiles/esp_whisper_funs.pymodule. Here is the relevant function:
blockid = 0
def preproc(data, foresp=True):
"""
Processes the audio data, scales it, and converts to the tensor format
When foresp is False, assume data is coming as a numpy array
When foresp is True, assume data came as a dictionary containing a list (the event block)
and convert that to a numpy array
"""
global blockid
if foresp:
aud = np.array([d["audio"] for d in data])
else:
aud = data
audio = aud / (2**15)
audio = np.expand_dims(audio, axis=0)
esp_in = dict(id=blockid) | get_model_inputs(audio, foresp=foresp)
blockid += 1
return esp_inThis function converts the data to numpy, scales it, and calls another function to add the ONNX parameters required by the Whisper model. The foresp keyword allows this function to be used in regular Python by disabling calls to functions in esp_utils, which convert SAS Event Stream Processing blobs to ONNX tensors. These functions are not needed outside SAS Event Stream Processing.
- Click Close.
- To examine the window's output schema, on the right toolbar, click
. Observe the following fields, which can be read from the model file directly using the popular open-source tool Netron, or in Python after importing the model with ONNX Runtime:
audio_pcm: The audio data as a tensor blob.- There are six parameter fields:
max_length,min_length,num_beams,num_return_sequences,length_penalty, andrepetition_penalty. These parameters use the default values supplied when the model was exported with Olive. For more information, see Source of the Model and Video Files below.
Explore the settings for this window:
- Click the w_reader window.
- Click
to go back to Properties. - In the right pane, expand Settings.
- The
whisper_sm_int8_cpu.onnxfile is the model that is used to transcribe the audio. You could replace this model with another Whisper model likewhisper-large-v3-turbo. However, changing the model would require you to also adjust other parts of the example project, including Python code. This is because the larger Whisper model has different BPE decoder mappings. - The selected execution provider is CPU, which means that hardware acceleration is not used. If you have a GPU, you can switch this to CUDA for a substantial performance improvement.
- The
- Expand Model Warm-up. The default warm-up tensor type is specified as Random tensor and the number of warm-up steps is set to 2. For more information, see [XML Language Elements Relevant to Model Warm-up] (https://go.documentation.sas.com/doc/en/espcdc/default/espxmllang/n0tippzgzce4uzn1kin6j79h732v.htm).
- Custom tensors from the file
warmup-whisper.safetensorsare provided for this mode because the model inputs are complex and cannot be randomly generated. The Python code to generate these warm-up tensors is provided in the filesavetensor-whisper.py. - The custom tensor contains a properly sized audio tensor and parameter tensors matching what would be generated by the w_preproc Python window.
- Custom tensors from the file
Explore the settings for this window:
- Click the w_score window.
- In the right pane, expand Settings. The Model source field is set to
Offline, which allows the Model type field to be set toONNX. The ONNX model is not specified here: the model is selected in the w_reader window and passed to the w_score window. - Click
to display the window’s schema. The schema differs from the schema of the w_source window in that the schema for the w_score window also contains a field called output, with a field type ofblob. The data for this field contains the scored events. - Click
. - Expand Input Map. The input map specifies the properties of the data to be scored. That is, input data is mapped to the variables that the ONNX model is expecting.
- Expand Output Map. The output map specifies the properties of the data that have been scored. The
outputfield specified in the schema appears here.
Explore the settings for this window:
- Click the w_postproc window.
- In the right pane, expand Python Settings. The Events function field is set to
postproc. This function is called each time the window receives a new event. - Scroll down to the Code source field and click View file.
This window calls the
postprocfunction from thefiles/esp_whisper_funs.pymodule. Here is the relevant function:
def postproc(data, context):
event = {}
raw_words = postproc_audio_tensor(data["generated_ids"])
event["timestamp"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
event["words"] = raw_words
return event- In the first line of the function, the
eventvariable is initialized as an empty dictionary (dict). - Next, the tokens are processed with
postproc_audio_tensor, which calls the BPE decoder. - A timestamp is created.
- Finally, the raw words are returned to SAS Event Stream Processing as a string.
- Click Close.
As discussed in the w_audio section, the w_audio window includes a publisher connector that is configured to read incoming events from an audio file.
When you enter test mode in SAS Event Stream Processing Studio, complete the following steps:
-
In the left pane, select the check boxes for the windows whose events you want to examine. For example, select the w_postproc window to view the postprocessed output of the model.
-
Select Configure and Run Test.
-
In the Load and Start Project in Cluster window, click Edit deployment settings.
-
In the Deployment Settings window, adjust the settings as shown in the following image. Make sure you request 4 GB of memory, though it probably runs with only 2 GB.
-
Click OK to close the Deployment Settings window.
-
Click OK to close the Load and Start Project in Cluster window.
The results for each window appear in separate tabs in test mode, as shown below. The transcription appears in the w_postproc tab.
You can upload different WAV files to transcribe. The names of the files to transcribe are specified in the Python file esp_whisper_funs.py. Scroll to the bottom and find the function publish_fun_ac and edit the path that contains test_files/rik_customwindow_16k.wav. You can use * as a wildcard syntax to match many files.
If you want to transcribe live audio, you need to deploy SAS Event Stream Processing as an edge server on a device with an attached microphone, such as an NVIDIA Jetson device. After setting that up, deactivate the Python connector and activate the audio connector to listen to the microphone.
If you require higher model accuracy, try using the larger model whisper-large-v3-turbo. You need to use a GPU for reasonable performance with this model. Edit the lines in ONNX Export Instructions to refer to whisper-large-v3-turbo and to use gpu_fp16 instead of cpu_int8. You also need to modify the bpe_decoder.py file to extract the new BPE encoding used by the larger model. You can extract the token map from the model once you create it. Before deleting the BPEDecoder node in the Python code described below, copy the parameters to the BPEDecoder node visible in that code into the bpe_decoder.py function. The parameter names match exactly because that node is the original source of the parameters.
- Model Family: Whisper
- Model Name: whisper-sm.en
- File Name: whisper_sm_int8_cpu.onnx
- Provider: https://github.com/openai/whisper
- License: MIT
The ONNX file for the example is provided, but can be re-created as follows. Install Olive (ONNX Live following their instructions. Next, see the example for Whisper here. After cloning the Olive repo at that specific Git commit, navigate to the Whisper directory and run
python prepare_whisper_configs.py --model_name openai/whisper-small --no_audio_decoder
olive run --config whisper_cpu_int8.json --setup
olive run --config whisper_cpu_int8.json
This downloads the Whisper model from Hugging Face and converts it to ONNX, using INT8 quantization optimized for the CPU.
Next, you need to remove the BPEDecoder op from the ONNX graph because it requires a custom library not shipped with SAS Event Stream Processing. Install the onnxconverter package and
- Run
python -m onnxconverter_common.onnx2py whisper.onnx out_whisper.py - Remove the last node, which is
BPEDecoder. Edit the output name, output dims, and type. The output name isgenerated_idswith typeINT64. - You need to implement
BPEDecoderyourself in Python later. - Run
python out_whisper.py whisper_cpu_sm.onnxto get the ONNX model for SAS Event Stream Processing.
At this point, you have the final ONNX model and can upload it to SAS Event Stream Processing Studio.
| File Name | Copyright | Notes |
|---|---|---|
rik_customwindow_16k.wav |
© 2025 SAS Institute Inc. All Rights Reserved. | To be used only in the context of this demo. |
The audio provided in this example are to be used only with the project provided. Using or altering these videos and images beyond the example for any other purpose is prohibited.

