Skip to content

Commit 68f6c20

Browse files
kirklandsignfacebook-github-bot
authored andcommitted
ET XNNPACK performance showcase
Summary: Add numbers for performance comparison in docs. Preview: https://www.internalfb.com/code/fbsource/[D48704724-V10]/fbcode/executorch/examples/backend/README.md Reviewed By: guangy10 Differential Revision: D48704724 fbshipit-source-id: 2bd3238f514e84c39d5f8393b998bad796fa672e
1 parent 822574b commit 68f6c20

File tree

1 file changed

+66
-0
lines changed

1 file changed

+66
-0
lines changed

examples/backend/README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,69 @@ Once we have the model binary (pte) file, then let's run it with Executorch runt
3131
```bash
3232
buck2 run examples/backend:xnn_executor_runner -- --model_path ./mv2_xnnpack_q8.pte
3333
```
34+
35+
## XNNPACK performance gain
36+
37+
### Overview
38+
39+
We tested the performance for MobileNet V2 and MobileNet V3 on Linux x86 and Mac (Apple Silicon) platforms.
40+
41+
For each model, we export three variations: portable (without any optimization), xnnpack fp32 (exported for XNNPACK delegation without quantization), xnnpack q8 (exported for XNNPACK delegation with qint8 delegation).
42+
43+
We build the benchmarking binary (will be released in the near future, but it is similar to `examples/backend:xnn_executor_runner`). Benchmarking binary, by default, runs 10 iterations of warmup and 50 iterations of benchmarking. Number reported here are average measured latency, in ms, across 50 runs. The first iteration is slower due to warm up, and the performance is is stable on subsequent iterations, so we also report the execution time for the first iteration for reference. Below is the model execution time for first iteration and subsequent iterations (average after warmup), in milliseconds. We use a single thread to test the models. Details about the methodology and repro steps are below the tables.
44+
45+
### Methodology
46+
47+
Models are exported with the steps above for XNNPACK delegation, and with `examples/export:export_example` for portable backend without any optimization. Then use `//examples/backend:xnn_executor_runner` with profiler (command listed below); or in the future, use the runtime in `//sdk/runners:executor_runner` since it gives more options such as number of iterations after build rules for OSS is added.
48+
49+
```
50+
buck run -c executorch.prof_enabled=true -c executorch.prof_buf_size=8096 -c executorch.num_prof_blocks=61 //examples/backend:xnn_executor_runner -- --model_path mv3.pte
51+
```
52+
53+
A rough number of execution time can be obtained via the log timestamp. The profiler result can be analyzed with `profiler:profiler_results_cli`.
54+
55+
```
56+
buck run //profiler:profiler_results_cli -- --prof_results_bin=prof_result.bin
57+
```
58+
59+
Run: Use 60 iterations. Usually the first iteration is slower, due to warm up. However, the performance from the second iteration is quite stable and reliable. We note down the execution time for first iteration; then for average execution time, we drop the first 10 iterations, and calculate the average time for the next 50 iterations.
60+
61+
Number we use: “run model” time in the profiler_results_cli tool. This represents the time to execute a model for an iteration. The numbers in the report are floored.
62+
63+
### Results
64+
65+
MobileNet V2 - Linux x86
66+
67+
| backend | first iteration (ms) | subsequent iteration (ms) |
68+
|--------------|----------------------|---------------------------|
69+
| portable | 25690 | 25480 |
70+
| xnnpack fp32 | 21 | 10 |
71+
| xnnpack q8 | 18 | 11 |
72+
73+
74+
MobileNet V2 - Mac
75+
76+
| backend | first iteration (ms) | subsequent iteration (ms) |
77+
|--------------|----------------------|---------------------------|
78+
| portable | 17743 | 17852 |
79+
| xnnpack fp32 | 21 | 16 |
80+
| xnnpack q8 | 20 | 18 |
81+
82+
83+
MobileNet V3 - Linux x86
84+
85+
| backend | first iteration (ms) | subsequent iteration (ms) |
86+
|--------------|----------------------|---------------------------|
87+
| portable | 4938 | 4975 |
88+
| xnnpack fp32 | 15 | 8 |
89+
| xnnpack q8 | 343 | 323 |
90+
91+
Note: MV3 does not have quantized hardsigomid and hardswish, this is because XNNPACK currently does not support quantized hardswish and hardsigmoid. Our current quantized partitioner only partitions quantized operators, so we do not lower these floating point ops, and they are run on portable. Ops running on portable lead to the worse performance for MV3 q8. We will eventually release a mixed datatype partitioner to fix this
92+
93+
MobileNet V3 - Mac
94+
95+
| backend | first iteration (ms) | subsequent iteration (ms) |
96+
|--------------|----------------------|---------------------------|
97+
| portable | 3427 | 3394 |
98+
| xnnpack fp32 | 7 | 4 |
99+
| xnnpack q8 | 206 | 201 |

0 commit comments

Comments
 (0)