ET XNNPACK performance showcase

kirklandsign · facebook-github-bot · commit 68f6c20aec84 · 2023-08-28T16:33:11.000-07:00
Summary: Add numbers for performance comparison in docs. Preview: https://www.internalfb.com/code/fbsource/[D48704724-V10]/fbcode/executorch/examples/backend/README.md Reviewed By: guangy10 Differential Revision: D48704724 fbshipit-source-id: 2bd3238f514e84c39d5f8393b998bad796fa672e
diff --git a/examples/backend/README.md b/examples/backend/README.md
@@ -31,3 +31,69 @@ Once we have the model binary (pte) file, then let's run it with Executorch runt
 ```bash
 buck2 run examples/backend:xnn_executor_runner -- --model_path ./mv2_xnnpack_q8.pte
 ```
+
+## XNNPACK performance gain
+
+### Overview
+
+We tested the performance for MobileNet V2 and MobileNet V3 on Linux x86 and Mac (Apple Silicon) platforms.
+
+For each model, we export three variations: portable (without any optimization), xnnpack fp32 (exported for XNNPACK delegation without quantization), xnnpack q8 (exported for XNNPACK delegation with qint8 delegation).
+
+We build the benchmarking binary (will be released in the near future, but it is similar to `examples/backend:xnn_executor_runner`). Benchmarking binary, by default, runs 10 iterations of warmup and 50 iterations of benchmarking. Number reported here are average measured latency, in ms, across 50 runs. The first iteration is slower due to warm up, and the performance is is stable on subsequent iterations, so we also report the execution time for the first iteration for reference. Below is the model execution time for first iteration and subsequent iterations (average after warmup), in milliseconds. We use a single thread to test the models. Details about the methodology and repro steps are below the tables.
+
+### Methodology
+
+Models are exported with the steps above for XNNPACK delegation, and with `examples/export:export_example` for portable backend without any optimization. Then use `//examples/backend:xnn_executor_runner` with profiler (command listed below); or  in the future, use the runtime in `//sdk/runners:executor_runner` since it gives more options such as number of iterations after build rules for OSS is added.
+
+```
+buck run -c executorch.prof_enabled=true -c executorch.prof_buf_size=8096 -c executorch.num_prof_blocks=61 //examples/backend:xnn_executor_runner -- --model_path mv3.pte
+```
+
+A rough number of execution time can be obtained via the log timestamp. The profiler result can be analyzed with `profiler:profiler_results_cli`.
+
+```
+buck run //profiler:profiler_results_cli -- --prof_results_bin=prof_result.bin
+```
+
+Run: Use 60 iterations. Usually the first iteration is slower, due to warm up. However, the performance from the second iteration is quite stable and reliable. We note down the execution time for first iteration; then for average execution time, we drop the first 10 iterations, and calculate the average time for the next 50 iterations.
+
+Number we use: “run model” time in the profiler_results_cli tool. This represents the time to execute a model for an iteration. The numbers in the report are floored.
+
+### Results
+
+MobileNet V2 - Linux x86
+
+| backend      | first iteration (ms) | subsequent iteration (ms) |
+|--------------|----------------------|---------------------------|
+| portable     | 25690                | 25480                     |
+| xnnpack fp32 | 21                   | 10                        |
+| xnnpack q8   | 18                   | 11                        |
+
+
+MobileNet V2 - Mac
+
+| backend      | first iteration (ms) | subsequent iteration (ms) |
+|--------------|----------------------|---------------------------|
+| portable     | 17743                | 17852                     |
+| xnnpack fp32 | 21                   | 16                        |
+| xnnpack q8   | 20                   | 18                        |
+
+
+MobileNet V3 - Linux x86
+
+| backend      | first iteration (ms) | subsequent iteration (ms) |
+|--------------|----------------------|---------------------------|
+| portable     | 4938                 | 4975                      |
+| xnnpack fp32 | 15                   | 8                         |
+| xnnpack q8   | 343                  | 323                       |
+
+Note: MV3 does not have quantized hardsigomid and hardswish, this is because XNNPACK currently does not support quantized hardswish and hardsigmoid. Our current quantized partitioner only partitions quantized operators, so we do not lower these floating point ops, and they are run on portable. Ops running on portable lead to the worse performance for MV3 q8. We will eventually release a mixed datatype partitioner to fix this
+
+MobileNet V3 - Mac
+
+| backend      | first iteration (ms) | subsequent iteration (ms) |
+|--------------|----------------------|---------------------------|
+| portable     | 3427                 | 3394                      |
+| xnnpack fp32 | 7                    | 4                         |
+| xnnpack q8   | 206                  | 201                       |