Update benchmarking.md

pareenaverma · web-flow · commit 69f6fbc6aea9 · 2025-10-02T15:17:36.000-04:00
diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md
@@ -6,59 +6,63 @@ weight: 6
 layout: learningpathall
 ---
 
-Now that you’ve set up and run the ONNX model (e.g., SqueezeNet), you can use it to benchmark inference performance using Python-based timing or tools like **onnxruntime_perf_test**. This helps evaluate the ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances.
-
-You can also compare the inference time between Cobalt 100 (Arm64) and similar D-series x86_64-based virtual machine on Azure.
+Now that you have validated ONNX Runtime with Python-based timing (e.g., SqueezeNet baseline test), you can move to using a dedicated benchmarking utility called `onnxruntime_perf_test`. This tool is designed for systematic performance evaluation of ONNX models, allowing you to capture more detailed statistics than simple Python timing.
+This helps evaluate the ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances and other x86_64 VM architectures.
 
 ## Run the performance tests using onnxruntime_perf_test
-The **onnxruntime_perf_test** is a performance benchmarking tool included in the ONNX Runtime source code. It is used to measure the inference performance of ONNX models under various runtime conditions (like CPU, GPU, or other execution providers).
+The `onnxruntime_perf_test` is a performance benchmarking tool included in the ONNX Runtime source code. It is used to measure the inference performance of ONNX models and supports multiple execution providers (like CPU, GPU, or other execution providers). on Arm64 VMs, CPU execution is the focus.
 
 ### Install Required Build Tools
+Before building or running `onnxruntime_perf_test`, you will need to install a set of development tools and libraries. These packages are required for compiling ONNX Runtime and handling model serialization via Protocol Buffers.
 
 ```console
 sudo apt update
 sudo apt install -y build-essential cmake git unzip pkg-config
 sudo apt install -y protobuf-compiler libprotobuf-dev libprotoc-dev git
 ```
-Then verify:
+Then verify protobuf installation:
 ```console
 protoc --version
 ```
-You should see an output similar to:
+You should see output similar to:
 
 ```output
 libprotoc 3.21.12
 ```
 ### Build ONNX Runtime from Source:
 
-The benchmarking tool, **onnxruntime_perf_test**, isn’t available as a pre-built binary artifact for any platform. So, you have to build it from the source, which is expected to take around 40-50 minutes. 
+The benchmarking tool `onnxruntime_perf_test`, isn’t available as a pre-built binary for any platform. So, you will have to build it from the source, which is expected to take around 40 minutes. 
 
-Clone onnxruntime:
+Clone onnxruntime repo:
 ```console
 git clone --recursive https://github.com/microsoft/onnxruntime 
 cd onnxruntime
 ```
-Now, build the benchmark as below:
+Now, build the benchmark tool:
 
 ```console
 ./build.sh --config Release --build_dir build/Linux --build_shared_lib --parallel --build --update --skip_tests 
 ```
-This will build the benchmark tool inside ./build/Linux/Release/onnxruntime_perf_test. 
+You should see the executable at:
+```output
+./build/Linux/Release/onnxruntime_perf_test
+```
 
 ### Run the benchmark
-Now that the benchmarking tool has been built, you can benchmark the **squeezenet-int8.onnx** model, as below:
+Now that you have built the benchmarking tool, you can run inference benchmarks on the SqueezeNet INT8 model:
 
 ```console
 ./build/Linux/Release/onnxruntime_perf_test -e cpu -r 100 -m times -s -Z -I ../squeezenet-int8.onnx
 ```
-- **e cpu**: Use the CPU execution provider (not GPU or any other backend). 
-- **r 100**: Run 100 inferences. 
-- **m times**: Use "repeat N times" mode. 
-- **s**: Show detailed statistics. 
-- **Z**: Disable intra-op thread spinning (reduces CPU usage when idle between runs). 
-- **I**: Input the ONNX model path without using input/output test data.
+Breakdown of the flags:
+  -e cpu → Use the CPU execution provider.
+  -r 100 → Run 100 inference passes for statistical reliability.
+  -m times → Run in “repeat N times” mode. Useful for latency-focused measurement.
+  -s → Show detailed per-run statistics (latency distribution).
+  -Z → Disable intra-op thread spinning. Reduces CPU waste when idle between runs, especially on high-core systems like Cobalt 100.
+  -I → Input the ONNX model path directly, skipping pre-generated test data.
 
-You should see an output similar to:
+You should see output similar to:
 
 ```output
 Disabling intra-op thread spinning between runs
@@ -84,12 +88,12 @@ P999 Latency: 0.00190312 s
 ```
 ### Benchmark Metrics Explained  
 
-- **Average Inference Time**: The mean time taken to process a single inference request across all runs. Lower values indicate faster model execution.  
-- **Throughput**: The number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently.  
-- **CPU Utilization**: The percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking.  
-- **Peak Memory Usage**: The maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments. 
-- **P50 Latency (Median Latency)**: The time below which 50% of inference requests complete. Represents typical latency under normal load.   
-- **Latency Consistency**: Describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter.  
+  * Average Inference Time: The mean time taken to process a single inference request across all runs. Lower values indicate faster model execution.  
+  * Throughput: The number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently.  
+  * CPU Utilization: The percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking.  
+  * Peak Memory Usage: The maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments. 
+  * P50 Latency (Median Latency): The time below which 50% of inference requests complete. Represents typical latency under normal load.   
+  * Latency Consistency: Describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter.  
 
 ### Benchmark summary on Arm64:
 Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**.
@@ -109,30 +113,12 @@ Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pr
 | **Latency Consistency**    | Consistent                   |
 
 
-### Benchmark summary on x86
-Here is a summary of benchmark results collected on x86 **D4s_v6 Ubuntu Pro 24.04 LTS virtual machine**.
-
-| **Metric**                | **Value on Virtual Machine** |
-|----------------------------|-------------------------------|
-| **Average Inference Time** | 1.413 ms                     |
-| **Throughput**             | 707.48 inferences/sec        |
-| **CPU Utilization**        | 100%                         |
-| **Peak Memory Usage**      | 38.80 MB                     |
-| **P50 Latency**            | 1.396 ms                     |
-| **P90 Latency**            | 1.501 ms                     |
-| **P95 Latency**            | 1.520 ms                     |
-| **P99 Latency**            | 1.794 ms                     |
-| **P999 Latency**           | 1.794 ms                     |
-| **Max Latency**            | 1.794 ms                     |
-| **Latency Consistency**    | Consistent                   |
-
-
-### Highlights from Ubuntu Pro 24.04 Arm64 Benchmarking
+### Highlights from Benchmarking on Azure Cobalt 100 Arm64 VMs
 
-When comparing the results on Arm64 vs x86_64 virtual machines:
-- **Low-Latency Inference:** Achieved consistent average inference times of ~1.86 ms on Arm64.  
-- **Strong and Stable Throughput:** Sustained throughput of over 538 inferences/sec using the `squeezenet-int8.onnx` model on D4ps_v6 instances.  
-- **Lightweight Resource Footprint:** Peak memory usage stayed below 37 MB, with CPU utilization around 96%, ideal for efficient edge or cloud inference.  
-- **Consistent Performance:** P50, P95, and Max latency remained tightly bound, showcasing reliable performance on Azure Cobalt 100 Arm-based infrastructure.
+The results on Arm64 virtual machines demonstrate:
+- Low-Latency Inference: Achieved consistent average inference times of ~1.86 ms on Arm64.  
+- Strong and Stable Throughput: Sustained throughput of over 538 inferences/sec using the `squeezenet-int8.onnx` model on D4ps_v6 instances.  
+- Lightweight Resource Footprint: Peak memory usage stayed below 37 MB, with CPU utilization around 96%, ideal for efficient edge or cloud inference.  
+- Consistent Performance: P50, P95, and Max latency remained tightly bound, showcasing reliable performance on Azure Cobalt 100 Arm-based infrastructure.
 
 You have now benchmarked ONNX on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64.