[cherry-pick] update quick_run_demo and best_practices docs (#7210)

chenjiaoAngel · web-flow · commit 9dd6c2ecb96f · 2021-10-13T16:23:19.000+08:00
diff --git a/docs/benchmark/best_practices.md b/docs/benchmark/best_practices.md
@@ -0,0 +1,49 @@
+# 最佳做法
+
+移动设备和嵌入式设备的计算资源有限，因此提高应用的资源效率非常重要。我们整理了一份最佳做法和策略的清单，可用于改善 Paddle Lite 模型的性能。
+
+## 基于任务选择最佳模型
+
+您需要根据任务在模型复杂性和大小之间进行权衡。
+- 如果您的任务需要高准确率，那么您可能需要一个大而复杂的模型。
+- 对于准确率要求较低的任务，则最好使用较小的模型，因为它们不仅占用的磁盘空间和内存更少，而且通常速度更快且更节能。
+
+## 基于模型进行性能分析
+
+在选择了合适的候选模型后，最好对模型进行性能分析和基准测试。Paddle Lite 基准测试工具[Profiler 工具](./profiler.md)有内置的性能分析器，可展示每个算子的性能分析数据。这能帮助理解性能瓶颈，以及哪些算子占据了大部分计算时间。
+
+通过 Profiler 工具，根据每个算子的性能分析数据，可按照以下三个方面完成模型性能优化：
+- 基于模型和算法思想的性能优化
+- 基于硬件特点的性能优化
+- 基于特定场景/特定模型的性能优化
+
+### 基于模型和算法思想的性能优化
+
+首先，根据需求选择最小的模型进行推理，因为这些模型通常更快、更节能。Paddle Lite 现支持量化等多种优化技术，具体细节信息请查看[量化文档](../user_guides/quant_post_static.md)。
+
+其次，分析模型结构，查看是否有可融合的算子（如 `convolution` 和 `batchnorm` 可融合成 `convolution` 实现）/可并行计算的分支，以减少模型的计算量或IO 操作。这种情况应该不多见，因为 Paddle Lite 已完成大部分融合算子添加。但是，如果您发现更好的融合算子支持，可参考[Pass 文档](../develop_guides/add_new_pass.md)添加新的融合算子支持。
+
+最后，分析模型中占比较高算子的算法思想，查看是否还有可优化的空间。目前 Paddle Lite 为大多数算子提供了优化版本，如果您有更好的实现方法，可以参考[新增OP文档](../develop_guides/add_operation.md)添加实现。
+
+### 基于硬件特点的性能优化
+
+根据您使用的硬件设备结构特点，查看热点算子（模型中占比高的算子）是否仍有优化空间。目前，Paddle Lite 已支持大部分硬件优化，如 ARM CPU，添加了 A53、 A35 和其他处理器如 A73、A75 等三类处理器的优化实现。如果您发现其他硬件可进一步优化，也欢迎您参考[新增硬件文档](../develop_guides/add_hardware.md) 或 [新增OP文档](../develop_guides/add_operation.md) 添加新的硬件优化实现。
+
+### 基于特定场景/特定模型的性能优化
+
+基于您目前使用场景，分析各部分应用耗时占比，选择占比高的应用，用其他方法进行优化实现，进而提高整个应用程序的性能。例如：该应用程序包含前后预处理实现，可以基于硬件添加前后预处理的优化实现（用 ARM 汇编实现 Opencv 图像处理算子，目前 Paddle Lite 已提供部分[图像算子](../api_reference/cv.md)的优化实现，可供调用），进一步提升整个应用程序的性能。
+
+## 基于第三方工具进行性能分析
+
+基于第三方工具（如 [Android Profiler](https://developer.android.google.cn/studio/profile/android-profiler) 和 [Instruments](https://help.apple.com/instruments/mac/current/）提供了丰富的可被用于调试应用的性能分析信息。有时错误可能不在模型中，而在与模型交互的部分应用代码中。请务必熟悉平台特定的性能分析工具和适用于该平台的最佳做法。
+
+## 基于异构硬件进行性能优化
+
+Paddle Lite 添加了多个使用速度更快的硬件（如 GPU、NPU 和 APU 等）来加速模型的新方式，也支持多种异构硬件加速方法如 ARM CPU 和 NPU 异构加速检测模型性能。
+
+>> 请注意：
+- 有些加速器更适合不同类型的模型
+- 有些新硬件只支持浮点模型或以特定方式优化的模型
+- 请务必对每个硬件类型进行基准测试，以查看它是否适合您的应用
+
+例如，如果您有一个非常小的模型，将该模型放在 GPU 可能不值得。相反，对于具有高运算强度的大型模型来说， GPU 就是很好的选择。
diff --git a/docs/index.rst b/docs/index.rst
@@ -26,6 +26,7 @@ Welcome to Paddle-Lite's documentation!
   
   benchmark/benchmark
   benchmark/benchmark_tools
+  benchmark/ best_practices
 
 .. toctree::
   :maxdepth: 1
@@ -38,6 +39,7 @@ Welcome to Paddle-Lite's documentation!
   quick_start/java_demo
   quick_start/python_demo
   quick_start/quant_post_dynamic_demo
+  quick_start/quick_run_demo
   quick_start/roadmap
 
 .. toctree::
diff --git a/docs/quick_start/cpp_demo.md b/docs/quick_start/cpp_demo.md
@@ -80,7 +80,7 @@ auto output_data=output_tensor->data<float>();
 编译和运行 Android C++ 示例程序，你需要准备：
 
 * 一台可以编译 Paddle Lite 的电脑，具体环境配置，请参考[文档](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html)，推荐使用 docker。
-* 一台安卓手机，并在电脑上安装 adb工具 ，以确保电脑和手机可以通过 adb 连接。
+* 一台安卓手机，并在电脑上安装 adb 工具 ，以确保电脑和手机可以通过 adb 连接。
 
 ### 2. 下载或者编译预测库
 （1） 下载预测库
diff --git a/docs/quick_start/quick_run_demo.md b/docs/quick_start/quick_run_demo.md
@@ -0,0 +1,82 @@
+# 试用 Paddle Lite
+## 概述
+本教程在模型已完成转换和预测库已完成编译情况下，告诉大家如何快速使用 Paddle Lite 推理，以获取最终的推理性能和精度数据。
+本文将以安卓端 CPU 为例，介绍口罩检测 Mask Detection 示例。
+
+## 环境准备
+此处环境准备包含两个方面：预测库下载和安卓手机环境准备。
+
+### 安卓手机环境
+准备一台安卓手机，并在电脑上安装 adb 工具 ，以确保电脑和手机可以通过 adb 连接。
+>> 备注：手机通过 USB 连接电脑，打开`设置 -> 开发者模式 -> USB调试 -> 允许（授权）当前电脑调试手机`。保证当前电脑已经安装[ adb 工具](https://developer.android.com/studio/command-line/adb)，运行以下命令，确认当前手机设备已被识别
+
+``` shell
+adb devices
+# 如果手机设备已经被正确识别，将输出如下信息
+List of devices attached
+017QXM19C1000664	device
+```
+
+### 预测库下载
+在预测库[ Lite 预编译库下载](release_lib)下载界面，可根据您的手机型号和运行需求选择合适版本。
+
+以**Android-ARMv8架构**为例，可以下载以下版本：
+
+| Arch  | with_extra | arm_stl | with_cv | 下载 |
+|:-------:|:-----:|:-----:|:-----:|:-------:|
+| armv8 | OFF | c++_static | OFF |[ 2.9-rc ](https://github.com/PaddlePaddle/Paddle-Lite/releases/download/v2.9/inference_lite_lib.android.armv8.gcc.c++_static.tar.gz)|
+
+**解压后内容结构如下：**
+
+```shell
+inference_lite_lib.android.armv8          Paddle Lite 预测库
+├── cxx                                       C++ 预测库
+│   ├── include                                   C++ 预测库头文件
+│   └── lib                                       C++ 预测库文件
+│       ├── libpaddle_api_light_bundled.a             静态预测库
+│       └── libpaddle_light_api_shared.so             动态预测库
+├── demo                                      示例 Demo
+│   ├── cxx                                       C++ 示例 Demo
+│       ├── mask_detection                           mask_detection Demo 文件夹
+│           ├── MakeFile                              MakeFile 文件，用于编译可执行文件
+│           └── mask_detection.cc                     C++ 接口的推理源文件
+│           └── prepare.sh                            下载模型和预测图片、运行环境准备脚本
+│           └── run.sh                                运行 mask_detection 可执行文件脚本
+│   └── java                                      Java 示例 Demo
+└── java                                      Java 预测库
+```
+
+## 运行
+在环境准备好，按照下述步骤完成口罩检测 Mask Detection 推理，获取模型的性能和精度数据
+
+```shell
+cd inference_lite_lib.android.armv8/demo/cxx/mask_detection
+
+# 准备预测部署文件
+bash prepare.sh
+
+# 执行预测
+cd mask_demo && bash run.sh
+
+# 运行成功后，将在控制台输出如下内容，可以打开test_img_result.jpg图片查看预测结果
+../mask_demo/: 9 files pushed, 0 skipped. 141.6 MB/s (28652282 bytes in 0.193s)
+Load detecion model succeed.
+
+======= benchmark summary =======
+model_dir: pyramidbox_lite_v2_9_1_opt2.nb
+repeats: 100
+*** time info(ms) ***
+1st_duration: 124.481
+max_duration: 123.179
+min_duration: 40.093
+avg_duration: 41.2289
+detection pre_process time: 4.924
+Detecting face succeed.
+
+Load classification model succeed.
+detect face, location: x=237, y=107, width=194, height=255, wear mask: 1, prob: 0.987625
+detect face, location: x=61, y=238, width=166, height=213, wear mask: 1, prob: 0.925679
+detect face, location: x=566, y=176, width=245, height=294, wear mask: 1, prob: 0.550348
+write result to file: test_img_result.jpg, success.
+/data/local/tmp/mask_demo/test_img_result.jpg: 1 file pulled, 0 skipped. 28.0 MB/s (279080 bytes in 0.010s)
+```
diff --git a/lite/demo/cxx/mask_detection/mask_detection.cc b/lite/demo/cxx/mask_detection/mask_detection.cc
@@ -12,6 +12,8 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
+#include <sys/time.h>
+#include <time.h>
 #include <iostream>
 #include <string>
 #include <vector>
@@ -132,6 +134,12 @@ void pre_process(const cv::Mat& img,
   neon_mean_scale(dimg, data, width * height, mean, scale);
 }
 
+inline double GetCurrentUS() {
+  struct timeval time;
+  gettimeofday(&time, NULL);
+  return 1e+6 * time.tv_sec + time.tv_usec;
+}
+
 void RunModel(std::string det_model_file,
               std::string class_model_file,
               std::string img_path) {
@@ -160,10 +168,44 @@ void RunModel(std::string det_model_file,
   // Do PreProcess
   std::vector<float> detect_mean = {104.f, 117.f, 123.f};
   std::vector<float> detect_scale = {0.007843, 0.007843, 0.007843};
+  auto start_pre = GetCurrentUS();
   pre_process(img, s_width, s_height, detect_mean, detect_scale, data, false);
+  auto detec_pre_end = (GetCurrentUS() - start_pre) / 1000.0;
 
   // Detection Model Run
-  predictor->Run();
+  double sum_duration = 0.0;  // millisecond;
+  double max_duration = 1e-5;
+  double min_duration = 1e5;
+  double avg_duration = -1;
+  double first_duration = -1;
+  auto repeats = 100;
+  for (int widx = 0; widx < 10; widx++) {
+    if (widx == 0) {
+      auto start0 = GetCurrentUS();
+      predictor->Run();
+      first_duration = (GetCurrentUS() - start0) / 1000.0;
+    } else {
+      predictor->Run();
+    }
+  }
+  for (int ridx = 0; ridx < repeats; ridx++) {
+    auto start0 = GetCurrentUS();
+    predictor->Run();
+    auto duration = (GetCurrentUS() - start0) / 1000.0;
+    sum_duration += duration;
+    max_duration = duration > max_duration ? duration : max_duration;
+    min_duration = duration < min_duration ? duration : min_duration;
+  }
+  avg_duration = sum_duration / static_cast<float>(repeats);
+  std::cout << "\n======= benchmark summary =======\n"
+            << "model_dir: " << det_model_file << "\n"
+            << "repeats: " << repeats << "\n"
+            << "*** time info(ms) ***\n"
+            << "1st_duration: " << first_duration << "\n"
+            << "max_duration: " << max_duration << "\n"
+            << "min_duration: " << min_duration << "\n"
+            << "avg_duration: " << avg_duration << "\n"
+            << "detection pre_process time: " << detec_pre_end << "\n";
 
   // Get Output Tensor
   std::unique_ptr<const Tensor> output_tensor0(
diff --git a/lite/demo/cxx/mask_detection/prepare.sh b/lite/demo/cxx/mask_detection/prepare.sh
@@ -17,8 +17,29 @@ if [ ! -f "mask_models_img.tar.gz" ];
 then
    wget -c https://paddle-inference-dist.cdn.bcebos.com/PaddleLiteDemo/mask_models_img.tar.gz 
 fi
+
+# v2.9.1 model, if need other version model, such as v2.8, you can use the following command:
+# wget -c https://paddlelite-demo.bj.bcebos.com/models/pyramidbox_lite_fp32_for_cpu_v2_8_0.tar.gz
+# other version model just change string of "v2_9_1" to responding string
+if [ ! -f "pyramidbox_lite_fp32_for_cpu_v2_9_1.tar.gz" ];
+then
+wget -c https://paddlelite-demo.bj.bcebos.com/models/pyramidbox_lite_fp32_for_cpu_v2_9_1.tar.gz
+fi
+
+# v2.9.1 model
+if [ ! -f "mask_detector_fp32_128_128_for_cpu_v2_9_1.tar.gz" ];
+then
+wget -c https://paddlelite-demo.bj.bcebos.com/models/mask_detector_fp32_128_128_for_cpu_v2_9_1.tar.gz
+fi
+
 tar zxf mask_models_img.tar.gz
+tar zxf pyramidbox_lite_fp32_for_cpu_v2_9_1.tar.gz
+mv model.nb pyramidbox_lite_v2_9_1_opt2.nb
+tar zxf mask_detector_fp32_128_128_for_cpu_v2_9_1.tar.gz
+mv model.nb mask_detector_v2_9_1_opt2.nb
 mv mask_models_img ${gf}
+mv pyramidbox_lite_v2_9_1_opt2.nb ${gf}
+mv mask_detector_v2_9_1_opt2.nb ${gf}
 
 # clean
 make clean
diff --git a/lite/demo/cxx/mask_detection/run.sh b/lite/demo/cxx/mask_detection/run.sh
@@ -5,8 +5,8 @@ mask_demo_path="/data/local/tmp/mask_demo"
 adb shell "cd ${mask_demo_path} \
            && export LD_LIBRARY_PATH=${mask_demo_path}:${LD_LIBRARY_PATH} \
            && ./mask_detection \
-                mask_models_img/pyramidbox_lite_opt2.nb \
-                mask_models_img/mask_detector_opt2.nb \
+                pyramidbox_lite_v2_9_1_opt2.nb \
+                mask_detector_v2_9_1_opt2.nb \
                 mask_models_img/test_img.jpg"
 
 adb pull ${mask_demo_path}/test_img_result.jpg .