Skip to content

Commit 7a80f23

Browse files
Merge pull request #2425 from madeline-underwood/squeeze
Squeeze_JA to sign off
2 parents 7a24f8d + 7d5c8fa commit 7a80f23

File tree

6 files changed

+151
-90
lines changed

6 files changed

+151
-90
lines changed

content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_index.md

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,19 @@
11
---
22
title: Deploy SqueezeNet 1.0 INT8 model with ONNX Runtime on Azure Cobalt 100
33

4-
draft: true
5-
cascade:
6-
draft: true
7-
4+
85
minutes_to_complete: 60
96

10-
who_is_this_for: This Learning Path introduces ONNX deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers deploying ONNX-based applications on Arm-based machines.
7+
who_is_this_for: This Learning Path is for developers deploying ONNX-based applications on Arm-based machines.
118

129
learning_objectives:
13-
- Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image.
14-
- Deploy ONNX on the Ubuntu Pro virtual machine.
15-
- Perform ONNX baseline testing and benchmarking on Arm64 virtual machines.
10+
- Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image
11+
- Perform ONNX baseline testing and benchmarking on Arm64 virtual machines
1612

1713
prerequisites:
18-
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6).
19-
- Basic understanding of Python and machine learning concepts.
20-
- Familiarity with [ONNX Runtime](https://onnxruntime.ai/docs/) and Azure cloud services.
14+
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6)
15+
- Basic understanding of Python and machine learning concepts
16+
- Familiarity with [ONNX Runtime](https://onnxruntime.ai/docs/) and Azure cloud services
2117

2218
author: Pareena Verma
2319

content/learning-paths/servers-and-cloud-computing/onnx-on-azure/background.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,33 @@ layout: "learningpathall"
88

99
## Azure Cobalt 100 Arm-based processor
1010

11-
Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
1211

13-
To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
12+
Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor, the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, it is a 64-bit CPU that delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads.
13+
14+
You can use Cobalt 100 for:
15+
16+
- Web and application servers
17+
- Data analytics
18+
- Open-source databases
19+
- Caching systems
20+
- Many other scale-out workloads
21+
22+
Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. You can learn more about Cobalt 100 in the Microsoft blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
1423

1524
## ONNX
16-
ONNX (Open Neural Network Exchange) is an open-source format designed for representing machine learning models.
17-
It provides interoperability between different deep learning frameworks, enabling models trained in one framework (such as PyTorch or TensorFlow) to be deployed and run in another.
1825

19-
ONNX models are serialized into a standardized format that can be executed by the ONNX Runtime, a high-performance inference engine optimized for CPU, GPU, and specialized hardware accelerators. This separation of model training and inference allows developers to build flexible, portable, and production-ready AI workflows.
26+
ONNX (Open Neural Network Exchange) is an open-source format designed for representing machine learning models.
27+
28+
You can use ONNX to:
29+
30+
- Move models between different deep learning frameworks, such as PyTorch and TensorFlow
31+
- Deploy models trained in one framework to run in another
32+
- Build flexible, portable, and production-ready AI workflows
33+
34+
ONNX models are serialized into a standardized format that you can execute with ONNX Runtime - a high-performance inference engine optimized for CPU, GPU, and specialized hardware accelerators. This separation of model training and inference lets you deploy models efficiently across cloud, edge, and mobile environments.
35+
36+
To learn more, see the [ONNX official website](https://onnx.ai/) and the [ONNX Runtime documentation](https://onnxruntime.ai/docs/).
37+
38+
## Next steps for ONNX on Azure Cobalt 100
2039

21-
ONNX is widely used in cloud, edge, and mobile environments to deliver efficient and scalable inference for deep learning models. Learn more from the [ONNX official website](https://onnx.ai/) and the [ONNX Runtime documentation](https://onnxruntime.ai/docs/).
40+
Now that you understand the basics of Azure Cobalt 100 and ONNX Runtime, you are ready to deploy and benchmark ONNX models on Arm-based Azure virtual machines. This Learning Path will guide you step by step through setting up an Azure Cobalt 100 VM, installing ONNX Runtime, and running machine learning inference on Arm64 infrastructure.

content/learning-paths/servers-and-cloud-computing/onnx-on-azure/baseline.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -36,19 +36,24 @@ python3 baseline.py
3636
You should see output similar to:
3737
```output
3838
Inference time: 0.0026061534881591797
39-
```
40-
{{% notice Note %}}Inference time is the amount of time it takes for a trained machine learning model to make a prediction (i.e., produce output) after receiving input data.
41-
input tensor of shape (1, 3, 224, 224):
42-
- 1: batch size
43-
- 3: color channels (RGB)
44-
- 224 x 224: image resolution (common for models like SqueezeNet)
39+
{{% notice Note %}}
40+
Inference time is how long it takes for a trained machine learning model to make a prediction after it receives input data.
41+
42+
The input tensor shape `(1, 3, 224, 224)` means:
43+
- `1`: One image is processed at a time (batch size)
44+
- `3`: Three color channels (red, green, blue)
45+
- `224 x 224`: Each image is 224 pixels wide and 224 pixels tall (standard for SqueezeNet)
4546
{{% /notice %}}
4647
4748
This indicates the model successfully executed a single forward pass through the SqueezeNet INT8 ONNX model and returned results.
4849
49-
#### Output summary:
50+
## Output summary:
5051
5152
Single inference latency(0.00260 sec): This is the time required for the model to process one input image and produce an output. The first run includes graph loading, memory allocation, and model initialization overhead.
5253
Subsequent inferences are usually faster due to caching and optimized execution.
5354
5455
This demonstrates that the setup is fully working, and ONNX Runtime efficiently executes quantized models on Arm64.
56+
57+
Great job! You've completed your first ONNX Runtime inference on Arm-based Azure infrastructure. This baseline test confirms your environment is set up correctly and ready for more advanced benchmarking.
58+
59+
Next, you'll use a dedicated benchmarking tool to capture more detailed performance statistics and further optimize your deployment.

content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md

Lines changed: 45 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,25 @@
11
---
2-
title: Benchmarking via onnxruntime_perf_test
2+
title: Benchmark ONNX runtime performance with onnxruntime_perf_test
33
weight: 6
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
Now that you have validated ONNX Runtime with Python-based timing (e.g., SqueezeNet baseline test), you can move to using a dedicated benchmarking utility called `onnxruntime_perf_test`. This tool is designed for systematic performance evaluation of ONNX models, allowing you to capture more detailed statistics than simple Python timing.
10-
This helps evaluate the ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances and other x86_64 instances. architectures.
9+
## Benchmark ONNX model inference on Azure Cobalt 100
10+
Now that you have validated ONNX Runtime with Python-based timing (for example, the SqueezeNet baseline test), you can move to using a dedicated benchmarking utility called `onnxruntime_perf_test`. This tool is designed for systematic performance evaluation of ONNX models, allowing you to capture more detailed statistics than simple Python timing.
11+
12+
This approach helps you evaluate ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances and compare results with other architectures if needed.
13+
14+
You are ready to run benchmarks, which is a key skill for optimizing real-world deployments.
15+
1116

1217
## Run the performance tests using onnxruntime_perf_test
13-
The `onnxruntime_perf_test` is a performance benchmarking tool included in the ONNX Runtime source code. It is used to measure the inference performance of ONNX models and supports multiple execution providers (like CPU, GPU, or other execution providers). on Arm64 VMs, CPU execution is the focus.
18+
The `onnxruntime_perf_test` tool is included in the ONNX Runtime source code. You can use it to measure the inference performance of ONNX models and compare different execution providers (such as CPU or GPU). On Arm64 VMs, CPU execution is the focus.
1419

15-
### Install Required Build Tools
16-
Before building or running `onnxruntime_perf_test`, you will need to install a set of development tools and libraries. These packages are required for compiling ONNX Runtime and handling model serialization via Protocol Buffers.
20+
21+
## Install required build tools
22+
Before building or running `onnxruntime_perf_test`, you need to install a set of development tools and libraries. These packages are required for compiling ONNX Runtime and handling model serialization via Protocol Buffers.
1723

1824
```console
1925
sudo apt update
@@ -29,35 +35,48 @@ You should see output similar to:
2935
```output
3036
libprotoc 3.21.12
3137
```
32-
### Build ONNX Runtime from Source:
38+
## Build ONNX Runtime from source
3339

34-
The benchmarking tool `onnxruntime_perf_test`, isn’t available as a pre-built binary for any platform. So, you will have to build it from the source, which is expected to take around 40 minutes.
40+
The benchmarking tool `onnxruntime_perf_test` isn’t available as a pre-built binary for any platform, so you will need to build it from source. This process can take up to 40 minutes.
3541

36-
Clone onnxruntime repo:
42+
Clone the ONNX Runtime repository:
3743
```console
3844
git clone --recursive https://github.com/microsoft/onnxruntime
3945
cd onnxruntime
4046
```
47+
4148
Now, build the benchmark tool:
4249

4350
```console
4451
./build.sh --config Release --build_dir build/Linux --build_shared_lib --parallel --build --update --skip_tests
4552
```
46-
You should see the executable at:
53+
If the build completes successfully, you should see the executable at:
4754
```output
4855
./build/Linux/Release/onnxruntime_perf_test
4956
```
5057

51-
### Run the benchmark
58+
59+
## Run the benchmark
5260
Now that you have built the benchmarking tool, you can run inference benchmarks on the SqueezeNet INT8 model:
5361

5462
```console
5563
./build/Linux/Release/onnxruntime_perf_test -e cpu -r 100 -m times -s -Z -I ../squeezenet-int8.onnx
5664
```
65+
5766
Breakdown of the flags:
58-
-e cpu → Use the CPU execution provider.
59-
-r 100 → Run 100 inference passes for statistical reliability.
60-
-m times → Run in “repeat N times” mode. Useful for latency-focused measurement.
67+
68+
- `-e cpu`: use the CPU execution provider.
69+
- `-r 100`: run 100 inference passes for statistical reliability.
70+
- `-m times`: run in “repeat N times” mode for latency-focused measurement.
71+
- `-s`: print summary statistics after the run.
72+
- `-Z`: disable memory arena for more consistent timing.
73+
- `-I ../squeezenet-int8.onnx`: path to your ONNX model file.
74+
75+
You should see output with latency and throughput statistics. If you encounter build errors, check that you have enough memory (at least 8 GB recommended) and all dependencies are installed. For missing dependencies, review the installation steps above.
76+
77+
If the benchmark runs successfully, you are ready to analyze and optimize your ONNX model performance on Arm-based Azure infrastructure.
78+
79+
Well done! You have completed a full benchmarking workflow. Continue to the next section to explore further optimizations or advanced deployment scenarios.
6180
-s → Show detailed per-run statistics (latency distribution).
6281
-Z → Disable intra-op thread spinning. Reduces CPU waste when idle between runs, especially on high-core systems like Cobalt 100.
6382
-I → Input the ONNX model path directly, skipping pre-generated test data.
@@ -86,17 +105,17 @@ P95 Latency: 0.00187393 s
86105
P99 Latency: 0.00190312 s
87106
P999 Latency: 0.00190312 s
88107
```
89-
### Benchmark Metrics Explained
108+
## Benchmark Metrics Explained
90109

91-
* Average Inference Time: The mean time taken to process a single inference request across all runs. Lower values indicate faster model execution.
92-
* Throughput: The number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently.
93-
* CPU Utilization: The percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking.
94-
* Peak Memory Usage: The maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments.
95-
* P50 Latency (Median Latency): The time below which 50% of inference requests complete. Represents typical latency under normal load.
96-
* Latency Consistency: Describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter.
110+
* Average inference time: the mean time taken to process a single inference request across all runs. Lower values indicate faster model execution.
111+
* Throughput: the number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently.
112+
* CPU utilization: the percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking.
113+
* Peak Memory Usage: the maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments.
114+
* P50 Latency (Median Latency): the time below which 50% of inference requests complete. Represents typical latency under normal load.
115+
* Latency Consistency: describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter.
97116

98-
### Benchmark summary on Arm64:
99-
Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**.
117+
## Benchmark summary on Arm64:
118+
Here is a summary of benchmark results collected on an Arm64 D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine.
100119

101120
| **Metric** | **Value** |
102121
|----------------------------|-------------------------------|
@@ -113,12 +132,9 @@ Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pr
113132
| **Latency Consistency** | Consistent |
114133

115134

116-
### Highlights from Benchmarking on Azure Cobalt 100 Arm64 VMs
135+
## Highlights from Benchmarking on Azure Cobalt 100 Arm64 VMs
136+
117137

118-
The results on Arm64 virtual machines demonstrate:
119-
- Low-Latency Inference: Achieved consistent average inference times of ~1.86 ms on Arm64.
120-
- Strong and Stable Throughput: Sustained throughput of over 538 inferences/sec using the `squeezenet-int8.onnx` model on D4ps_v6 instances.
121-
- Lightweight Resource Footprint: Peak memory usage stayed below 37 MB, with CPU utilization around 96%, ideal for efficient edge or cloud inference.
122-
- Consistent Performance: P50, P95, and Max latency remained tightly bound, showcasing reliable performance on Azure Cobalt 100 Arm-based infrastructure.
138+
These results on Arm64 virtual machines demonstrate low-latency inference, with consistent average inference times of approximately 1.86 ms. Throughput remains strong and stable, sustaining over 538 inferences per second using the `squeezenet-int8.onnx` model on D4ps_v6 instances. The resource footprint is lightweight, as peak memory usage stays below 37 MB and CPU utilization is around 96%, making this setup ideal for efficient edge or cloud inference. Performance is also consistent, with P50, P95, and maximum latency values tightly grouped, showcasing reliable results on Azure Cobalt 100 Arm-based infrastructure.
123139

124140
You have now successfully benchmarked inference time of ONNX models on an Azure Cobalt 100 Arm64 virtual machine.

0 commit comments

Comments
 (0)