Skip to content

Commit 8e58ba0

Browse files
committed
TorchBench LP
1 parent 8387689 commit 8e58ba0

File tree

3 files changed

+224
-0
lines changed

3 files changed

+224
-0
lines changed
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
title: Accelerate and measure PyTorch Inference on Arm servers
3+
4+
minutes_to_complete: 20
5+
6+
who_is_this_for: This is an introductory topic for software developers who want to learn how to measure and accelerate the performance of Natural Language Processing (NLP), vision and recommender PyTorch models on Arm-based servers.
7+
8+
learning_objectives:
9+
- Download and install the PyTorch Benchmarks suite.
10+
- Evaluate the performance of PyTorch model inference running on your Arm based server using the PyTorch Benchmark suite.
11+
- Measure the performance of these models using eager and torch.compile modes in PyTorch.
12+
13+
prerequisites:
14+
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
15+
16+
author_primary: Pareena Verma
17+
18+
### Tags
19+
skilllevels: Introductory
20+
subjects: ML
21+
armips:
22+
- Neoverse
23+
operatingsystems:
24+
- Linux
25+
tools_software_languages:
26+
- Python
27+
- PyTorch
28+
29+
### FIXED, DO NOT MODIFY
30+
# ================================================================================
31+
weight: 1 # _index.md always has weight of 1 to order correctly
32+
layout: "learningpathall" # All files under learning paths have this same wrapper
33+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
34+
---
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
next_step_guidance: >
3+
Thank you for completing this Learning Path on how to measure your PyTorch inference performance on Arm based servers. You might be interested in learning how to use the Keras Core with TensorFlow, PyTorch, and JAX backends.
4+
5+
recommended_path: "/learning-paths/servers-and-cloud-computing/keras-core/"
6+
7+
further_reading:
8+
- resource:
9+
title: PyTorch Benchmarks
10+
link: https://github.com/pytorch/benchmark
11+
type: website
12+
- resource:
13+
title: PyTorch Inference Performance Tuning on AWS Graviton Processors
14+
link: https://pytorch.org/tutorials/recipes/inference_tuning_on_aws_graviton.html
15+
type: documentation
16+
- resource:
17+
title: ML inference on Graviton CPUs with PyTorch
18+
link: https://github.com/aws/aws-graviton-getting-started/blob/main/machinelearning/pytorch.md
19+
type: documentation
20+
- resource:
21+
title: PyTorch Documentation
22+
link: https://pytorch.org/docs/stable/index.html
23+
type: documentation
24+
25+
26+
# ================================================================================
27+
# FIXED, DO NOT MODIFY
28+
# ================================================================================
29+
weight: 21 # set to always be larger than the content in this path, and one more than 'review'
30+
title: "Next Steps" # Always the same
31+
layout: "learningpathall" # All files under learning paths have this same wrapper
32+
---
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
---
2+
title: Measure and accelerate the inference performance of PyTorch models on Arm servers
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Before you begin
10+
The instructions in this Learning Path are for any Arm server running Ubuntu 22.04 LTS. For this example, you need an Arm server instance with at least four cores and 8GB of RAM. The instructions have been tested on AWS Graviton3 (c7g.4xlarge) instances.
11+
12+
## Overview
13+
PyTorch is a widely-used Machine Learning framework for Python. In this learning path, you will explore how to measure the inference time of PyTorch models running on your Arm-based server using [PyTorch Benchmarks](https://github.com/pytorch/benchmark). PyTorch Benchmarks is a collection of open-source benchmarks designed to evaluate PyTorch performance. Understanding model inference latency is crucial for optimizing machine learning applications, especially in production environments where performance can significantly impact user experience and resource utilization. You will learn how to install the PyTorch benchmark suite and compare inference performance using PyTorch's two modes of execution - eager and torch.compile modes.
14+
15+
To begin, you need to set up your environment by installing the necessary dependencies and PyTorch. Follow these steps:
16+
17+
## Setup Environment
18+
19+
First, install python and the required system packages:
20+
21+
```bash
22+
sudo apt update
23+
sudo apt install python-is-python3 python3-pip python3-venv -y
24+
sudo apt-get install -y libgl1-mesa-glx
25+
```
26+
27+
Next, use a virtual environment to manage your Python packages. Create and activate a virtual environment:
28+
29+
```bash
30+
python3 -m venv venv
31+
source venv/bin/activate
32+
```
33+
34+
With your virtual environment active, install PyTorch and its related libraries:
35+
36+
```bash
37+
pip install torch torchvision torchaudio
38+
```
39+
40+
## Clone the PyTorch Benchmark Repository
41+
42+
Clone the PyTorch Benchmark repository and check out a specific commit you will use for performance evaluation:
43+
44+
```bash
45+
git clone https://github.com/pytorch/benchmark.git
46+
cd benchmark
47+
git checkout 9a5e4137299741e1b6fb7aa7f5a6a853e5dd2295
48+
```
49+
Install the PyTorch models you would like to benchmark. Lets install a variety of NLP, computer vision and recommender models:
50+
51+
```bash
52+
python3 install.py alexnet BERT_pytorch dlrm hf_Albert hf_Bart hf_Bert hf_Bert_large hf_BigBird hf_DistilBert hf_GPT2 hf_Longformer hf_Reformer hf_T5 mobilenet_v2 mobilenet_v3_large resnet152 resnet18 resnet50 timm_vision_transformer
53+
```
54+
55+
If you don't provide a model list to `install.py`, the script will download all the models included in the benchmark suite.
56+
57+
Before running the benchmarks, configure your running AWS Graviton3 instance to take advantage of the optimizations available to optimize PyTorch inference performance. This includes settings to:
58+
* Enable bfloat16 GEMM kernel support to accelerate fp32 inference.
59+
* Set LRU cache capacity to an optimal value to avoid redundant primitive creation latency overhead.
60+
* Enable Linux Transparent Huge Page (THP) allocations, reducing the latency for tensor memory allocation.
61+
* Set the number of threads to use to match the number of cores on your system
62+
63+
```bash
64+
export DNNL_DEFAULT_FPMATH_MODE=BF16
65+
export THP_MEM_ALLOC_ENABLE=1
66+
export LRU_CACHE_CAPACITY=1024
67+
export OMP_NUM_THREADS=16
68+
```
69+
70+
With the environment set up and models installed, you can now run the benchmarks to measure your model inference performance.
71+
72+
Starting from PyTorch 2.0, there are 2 main execution modes - eager mode and `torch.compile` mode. The default mode of execution in PyTorch is eager mode. In this mode the operations are executed immediately as they are defined. With `torch.compile` the PyTorch code is transformed into graphs which can be executed more efficiently. This mode can offer improved model inferencing performance, especially for models with repetitive computations.
73+
74+
Using the scripts included in the PyTorch Benchmark suite, you will now measure the model inference latencies with both eager and torch.compile modes to compare their performance.
75+
76+
### Measure Eager Mode Performance
77+
78+
Run the following command to collect performance data in eager mode for the suite of models you downloaded:
79+
80+
```bash
81+
python3 run_benchmark.py cpu --model alexnet,BERT_pytorch,dlrm,hf_Albert,hf_Bart,hf_Bert,hf_Bert_large,hf_BigBird,hf_DistilBert,hf_GPT2,hf_Longformer,hf_Reformer,hf_T5,mobilenet_v2,mobilenet_v3_large,resnet152,resnet18,resnet50,timm_vision_transformer --test eval --metrics="latencies"
82+
```
83+
The results for all the models run will be stored in the `.userbenchmark/cpu/` directory. The `cpu` user benchmark creates a folder `cpu-YYmmddHHMMSS` for the test, and aggregates all test results into a JSON file `metrics-YYmmddHHMMSS.json`.`YYmmddHHMMSS` is the time you started the test. The metrics file shows the model inference latency, in milliseconds (msec) for each model you downloaded and ran. The results with eager mode should look like:
84+
85+
```output
86+
{
87+
"name": "cpu",
88+
"environ": {
89+
"pytorch_git_version": "2236df1770800ffea5697b11b0bb0d910b2e59e1"
90+
},
91+
"metrics": {
92+
"mobilenet_v3_large-eval_latency": 115.3942605,
93+
"mobilenet_v2-eval_latency": 99.127155,
94+
"resnet152-eval_latency": 1115.0839365,
95+
"hf_Albert-eval_latency": 134.34109999999998,
96+
"hf_Bert_large-eval_latency": 295.00577799999996,
97+
"hf_Bart-eval_latency": 149.313368,
98+
"resnet50-eval_latency": 469.561532,
99+
"hf_GPT2-eval_latency": 185.68859650000002,
100+
"hf_Longformer-eval_latency": 215.187826,
101+
"hf_DistilBert-eval_latency": 72.3893025,
102+
"dlrm-eval_latency": 21.344289500000002,
103+
"hf_BigBird-eval_latency": 367.279237,
104+
"BERT_pytorch-eval_latency": 67.36218,
105+
"resnet18-eval_latency": 42.107551,
106+
"hf_T5-eval_latency": 83.166863,
107+
"alexnet-eval_latency": 170.11994449999997,
108+
"hf_Reformer-eval_latency": 81.8123215,
109+
"timm_vision_transformer-eval_latency": 258.6363415,
110+
"hf_Bert-eval_latency": 118.3291215
111+
}
112+
}
113+
```
114+
### Measure torch.compile Mode Performance
115+
116+
The `torch.compile` mode in PyTorch uses inductor as its default backend. For execution on the cpu, the inductor backend leverages C++/OpenMP to generate highly optimized kernels for your model. Run the following command to collect performance data in `torch.compile` mode for the suite of models you downloaded.
117+
118+
```bash
119+
python3 run_benchmark.py cpu --model alexnet,BERT_pytorch,dlrm,hf_Albert,hf_Bart,hf_Bert,hf_Bert_large,hf_BigBird,hf_DistilBert,hf_GPT2,hf_Longformer,hf_Reformer,hf_T5,mobilenet_v2,mobilenet_v3_large,resnet152,resnet18,resnet50,timm_vision_transformer --test eval --torchdynamo inductor --metrics="latencies"
120+
```
121+
122+
The results for all the models run will be stored in the `.userbenchmark/cpu/` directory. The `cpu` user benchmark creates a folder `cpu-YYmmddHHMMSS` for the test, and aggregates all test results into a JSON file `metrics-YYmmddHHMMSS.json`.`YYmmddHHMMSS` is the time you started the test. The metrics file show the model inference latency, in milliseconds (msec) for each model you downloaded and ran. The results with `torch.compile` mode should look like:
123+
124+
```output
125+
{
126+
"name": "cpu",
127+
"environ": {
128+
"pytorch_git_version": "2236df1770800ffea5697b11b0bb0d910b2e59e1"
129+
},
130+
"metrics": {
131+
"mobilenet_v3_large-eval_latency": 47.909326,
132+
"mobilenet_v2-eval_latency": 35.976583,
133+
"resnet152-eval_latency": 596.8526609999999,
134+
"hf_Albert-eval_latency": 87.863602,
135+
"hf_Bert_large-eval_latency": 282.57478649999996,
136+
"hf_Bart-eval_latency": 137.8793465,
137+
"resnet50-eval_latency": 245.21206,
138+
"hf_GPT2-eval_latency": 94.8732555,
139+
"hf_Longformer-eval_latency": 213.98017049999999,
140+
"hf_DistilBert-eval_latency": 65.187752,
141+
"dlrm-eval_latency": 18.2130865,
142+
"hf_BigBird-eval_latency": 281.18494050000004,
143+
"BERT_pytorch-eval_latency": 71.429891,
144+
"resnet18-eval_latency": 30.945619,
145+
"hf_T5-eval_latency": 124.513945,
146+
"alexnet-eval_latency": 123.83680100000001,
147+
"hf_Reformer-eval_latency": 58.992528,
148+
"timm_vision_transformer-eval_latency": 267.533416,
149+
"hf_Bert-eval_latency": 102.096192
150+
}
151+
}
152+
```
153+
You will notice that most of these models show a performance improvement in model inference latency when run with the `torch.compile` model using the inductor backend.
154+
155+
You have successfully run the PyTorch Benchmark suite on a variety of different models. You can experiment with the 2 different execution modes and different optimization settings, check the performance and choose the right settings for your model and use case.
156+
157+
158+

0 commit comments

Comments
 (0)