[M1 cherry-pick]fix M1-arm64 build_optimize_tool and benchmark tool bug (#8925)

chenjiaoAngel · web-flow · commit 0a0a2d24973b · 2022-04-25T14:20:29.000+08:00
diff --git a/cmake/configure.cmake b/cmake/configure.cmake
@@ -297,6 +297,10 @@ if (LITE_WITH_ARM82_FP16)
   add_definitions("-DLITE_WITH_ARM82_FP16")
 endif(LITE_WITH_ARM82_FP16)
 
+if (LITE_WITH_M1)
+add_definitions("-DLITE_WITH_M1")
+endif(LITE_WITH_M1)
+
 if (WITH_CONVERT_TO_SSA STREQUAL ON)
   add_definitions("-DWITH_CONVERT_TO_SSA")
 endif(WITH_CONVERT_TO_SSA)
diff --git a/docs/index.rst b/docs/index.rst
@@ -60,6 +60,7 @@ Welcome to Paddle-Lite's documentation!
   user_guides/quant_aware
   user_guides/model_visualization
   user_guides/profiler
+  user_guides/sparse
 
 .. toctree::
   :maxdepth: 1
diff --git a/docs/user_guides/model_optimize_tool.md b/docs/user_guides/model_optimize_tool.md
@@ -28,13 +28,22 @@ pip install x2paddle
 ./lite/tools/build.sh build_optimize_tool
 ```
 
-如果在 arm64 架构的 MacOS 下编译 opt 工具失败，试着删除 third-party 目录并重新`git checkout third-party`，然后将上一条指令改为
+如果在 arm64 架构的 MacOS 下编译 opt 工具失败
+
+- 方法1: 试着删除 third-party 目录并重新`git checkout third-party`，然后将上一条指令改为:
+
 ```shell
 arch -x86_64 ./lite/tools/build.sh build_optimize_tool
 ```
-该命令会编译 x86 格式的 opt 工具，但是不会影响工具的正常使用，编译成功后，在./build.opt/lite/api目录下，生成了可执行文件 opt
+  该命令会编译 x86 格式的 opt 工具，但是不会影响工具的正常使用，编译成功后，在./build.opt/lite/api目录下，生成了可执行文件 opt
+- 方法2: 使用 `build_macos.sh` 脚本进行编译
+
+```shell
+./lite/tools/build_macos.sh build_optimize_tool
+```
+
+[使用可执行文件 opt 工具](./opt/opt_bin)
 
- [使用可执行文件 opt 工具](./opt/opt_bin)
 ## 使用 X2paddle 导出 Padde Lite 支持格式
 
 **背景**：如果想用 Paddle Lite 运行第三方来源（TensorFlow、Caffe、ONNX、PyTorch）模型，一般需要经过两次转化。即使用 X2paddle 工具将第三方模型转化为 PaddlePaddle 格式，再使用 opt 将 PaddlePaddle 模型转化为Padde Lite 可支持格式。
diff --git a/docs/user_guides/sparse.md b/docs/user_guides/sparse.md
@@ -0,0 +1,83 @@
+# 模型的非结构化稀疏
+
+常见的稀疏方式可分为结构化稀疏和非结构化稀疏。前者在某个特定维度（特征通道、卷积核等等）上对卷积、矩阵乘法做剪枝操作，然后生成一个更小的模型结构，这样可以复用已有的卷积、矩阵乘计算，无需特殊实现推理算子；后者以每一个参数为单元稀疏化，然而并不会改变参数矩阵的形状，只是变成了含有大量零值的稀疏矩阵，所以更依赖于推理库、硬件对于稀疏后矩阵运算的加速能力。更多介绍请参照[这篇技术文章](https://mp.weixin.qq.com/s/l__C5IOu3z7uQdcWKnViOw)。
+
+本文从推理的视角，介绍如何基于 Paddle Lite 的系列工具，在稀疏模型上获得更优性能。
+
+## 非结构化稀疏训练
+
+### 1 简介
+
+稀疏化训练是使用全量训练数据，对训练好的稠密模型进行稀疏。在训练过程中，该方法只优化部分重要参数，对不重要的参数置零，达到保证稀疏模型精度的效果。
+
+使用条件：
+
+- 有预训练模型
+- 有全量的训练数据
+
+使用步骤：
+
+-  产出稀疏模型：使用 PaddleSlim 调用稀疏训练接口，产出稀疏模型
+-  稀疏模型预测：使用 Paddle Lite 加载稀疏模型进行预测推理
+
+优点：
+
+-  减小计算量、降低计算内存、减小 FP32 模型大小
+-  模型精度受稀疏影响小
+
+缺点：
+
+-  需要全量数据，训练时间较长
+
+建议首先使用 [虚拟稀疏](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/paddleslim/auto_compression/utils/prune_model.py#L12) 的接口对稠密推理模型进行快速稀疏（只保证稀疏度，不保证精度）；然后使用稀疏模型进行预测。如果该稀疏模型的性能达不到要求或超出要求，再调大或者调小稀疏度；最后使用适合的稀疏度开始稀疏训练。
+
+### 2 产出稀疏模型
+
+目前，PaddleSlim 的稀疏训练主要针对 1x1卷积，对应算子是 conv2d。Paddle Lite 支持运行 PaddleSlim 稀疏训练产出的模型，可以加快模型在移动端的执行速度。
+
+温馨提示：如果您是初次接触 PaddlePaddle 框架，建议首先学习[使用文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/index_cn.html)。
+
+使用 PaddleSlim 模型压缩工具训练稀疏模型，请参考文档：
+* 稀疏训练接口 [动态图](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/dygraph/pruners/unstructured_pruner.rst)|[静态图](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/static/prune/unstructured_prune_api.rst)
+* 稀疏训练Demo [动态图](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/demo/dygraph/unstructured_pruning)| [静态图](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/demo/unstructured_prune)
+
+
+### 3 使用 Paddle Lite 运行稀疏模型推理
+
+首先，使用 Paddle Lite 提供的模型转换工具（model_optimize_tool）将稀疏模型转换成移动端预测的模型，然后加载转换后的模型进行预测部署。
+
+#### 3.1 模型转换
+
+参考[模型转换](../user_guides/model_optimize_tool.md)准备模型转换工具，建议从 Release 页面下载。
+
+参考[模型转换](../user_guides/model_optimize_tool.md)使用模型转换工具，参数按照实际情况设置。比如在安卓手机ARM端进行预测，模型转换的命令为：
+
+```bash
+./OPT --model_dir=./mobilenet_v1_quant \
+      --optimize_out_type=naive_buffer \
+      --optimize_out=mobilenet_v1_quant_opt \
+      --valid_targets=arm \
+      --sparse_model=true --sparse_theshold=0.5
+```
+
+注意，我们通过上述的 sparse_model 和 sparse_threshold 两个参数控制是否对模型进行稀疏优化：
+
+ - 当 sparse_model=false时，稀疏优化关闭，所有的参数都不会被稀疏
+ - 当 sparse_model=true时，稀疏优化打开
+	 - 当前参数矩阵稀疏度大于 sparse_threshold 时，会被稀疏
+	 - 当前参数矩阵稀疏度小于 sparse_threshold 时，不会被稀疏
+
+#### 3.2 稀疏模型预测
+
+和 FP32 模型一样，转换后的稀疏模型可以在 Android APP 中加载预测，建议参考[C++ Demo](./cpp_demo.md)。
+
+
+### FAQ
+
+**问题**：为什么模型优化(*.nb文件)后，稀疏 FP32 模型的体积比稠密 FP32 小了，但是稀疏 INT8 模型的体积反而比稠密 INT8 模型体积大了？
+
+**解答**：这是可能出现的现象，因为稀疏格式中，我们虽然节省了部分 INT8 参数的存储空间，但是引入了 INT32 类型的 index，所以理论上75%稀疏度以下时，INT8 模型体积是会有些增大的。
+
+**问题**：当前非结构化稀疏的适用范围是什么
+  
+**解答**：在推理上， PaddleLite-2.11 支持 1x1卷积的非结构化和半结构化稀疏（2x1 的block为一个单元进行稀疏）；全连接层的稀疏正在开发中。同时，暂时只支持 ARM CPU （例如高通系列，瑞芯微系列）上的稀疏推理。
diff --git a/lite/api/tools/benchmark/benchmark.cc b/lite/api/tools/benchmark/benchmark.cc
@@ -90,23 +90,16 @@ void RunImpl(std::shared_ptr<PaddlePredictor> predictor,
              const int cnt,
              const bool repeat_flag) {
   lite::Timer timer;
-  bool has_validation_set = FLAGS_validation_set.empty();
-  if (!has_validation_set) {
-    timer.Start();
-    task->PreProcess(predictor, config, image_files, cnt);
-    perf_data->set_pre_process_time(timer.Stop());
-  }
-
+  timer.Start();
+  task->PreProcess(predictor, config, image_files, cnt);
+  perf_data->set_pre_process_time(timer.Stop());
   timer.Start();
   predictor->Run();
   perf_data->set_run_time(timer.Stop());
-
-  if (!has_validation_set) {
-    timer.Start();
-    task->PostProcess(
-        predictor, config, image_files, word_labels, cnt, repeat_flag);
-    perf_data->set_post_process_time(timer.Stop());
-  }
+  timer.Start();
+  task->PostProcess(
+      predictor, config, image_files, word_labels, cnt, repeat_flag);
+  perf_data->set_post_process_time(timer.Stop());
 }
 #endif
 
@@ -171,6 +164,7 @@ void Run(const std::string& model_file,
 #endif
   }
 
+  bool has_validation_set = !(FLAGS_validation_set.empty());
   // Warmup
   for (int i = 0; i < FLAGS_warmup; ++i) {
 #ifdef __ANDROID__
@@ -188,21 +182,27 @@ void Run(const std::string& model_file,
     timer.SleepInMs(FLAGS_run_delay);
   }
 
-  // Run
-  for (int i = 0; i < FLAGS_repeats; ++i) {
+  if (has_validation_set) {
+    for (int i = 0; i < FLAGS_repeats; ++i) {
 #ifdef __ANDROID__
-    RunImpl(predictor,
-            &perf_data,
-            task.get(),
-            config,
-            image_files,
-            word_labels,
-            i,
-            true);
+      RunImpl(predictor,
+              &perf_data,
+              task.get(),
+              config,
+              image_files,
+              word_labels,
+              i,
+              true);
 #else
-    RunImpl(predictor, &perf_data);
+      RunImpl(predictor, &perf_data);
 #endif
-    timer.SleepInMs(FLAGS_run_delay);
+      timer.SleepInMs(FLAGS_run_delay);
+    }
+  } else {
+    for (int i = 0; i < FLAGS_repeats; ++i) {
+      RunImpl(predictor, &perf_data);
+      timer.SleepInMs(FLAGS_run_delay);
+    }
   }
 
   // Get output
diff --git a/lite/api/tools/benchmark/benchmark.h b/lite/api/tools/benchmark/benchmark.h
@@ -358,15 +358,6 @@ const std::string OutputOptModel(const std::string& opt_model_file) {
       opt.SetValidPlaces(FLAGS_backend);
     }
   }
-
-  auto npos = opt_model_file.find(".nb");
-  std::string out_name = opt_model_file.substr(0, npos);
-#ifdef __ANDROID__
-  if (out_name.empty()) {
-    out_name = "/data/local/tmp/";
-  }
-#endif
-  opt.SetOptimizeOut(out_name);
   bool is_opt_model =
       (FLAGS_uncombined_model_dir.empty() && FLAGS_model_file.empty() &&
        FLAGS_param_file.empty() && !FLAGS_optimized_model_file.empty());
@@ -380,15 +371,23 @@ const std::string OutputOptModel(const std::string& opt_model_file) {
     return FLAGS_optimized_model_file;
   }
 
+  std::string model_dir = FLAGS_uncombined_model_dir;
   if (!FLAGS_uncombined_model_dir.empty()) {
     opt.SetModelDir(FLAGS_uncombined_model_dir);
   } else {
+    model_dir = FLAGS_model_file.substr(0, FLAGS_model_file.rfind("/"));
     opt.SetModelFile(FLAGS_model_file);
     opt.SetParamFile(FLAGS_param_file);
   }
+  auto npos = opt_model_file.find(".nb");
+  std::string out_name = opt_model_file.substr(0, npos);
+  if (out_name.empty()) {
+    out_name = model_dir + "/opt";
+  }
+  opt.SetOptimizeOut(out_name);
 
   std::string saved_opt_model_file =
-      opt_model_file.empty() ? "/data/local/tmp/.nb" : opt_model_file;
+      opt_model_file.empty() ? out_name + ".nb" : opt_model_file;
   if (paddle::lite::IsFileExists(saved_opt_model_file)) {
     int err = system(
         lite::string_format("rm -rf %s", saved_opt_model_file.c_str()).c_str());
diff --git a/lite/core/device_info.cc b/lite/core/device_info.cc
@@ -92,6 +92,10 @@ LITE_THREAD_LOCAL int64_t DeviceInfo::count_ = 0;
 const int DEFAULT_L1_CACHE_SIZE = 64 * 1024;
 const int DEFAULT_L2_CACHE_SIZE = 2048 * 1024;
 const int DEFAULT_L3_CACHE_SIZE = 0;
+#elif defined(LITE_WITH_M1)
+const int DEFAULT_L1_CACHE_SIZE = 128 * 1024;
+const int DEFAULT_L2_CACHE_SIZE = 4096 * 1024;
+const int DEFAULT_L3_CACHE_SIZE = 0;
 #else
 const int DEFAULT_L1_CACHE_SIZE = 32 * 1024;
 const int DEFAULT_L2_CACHE_SIZE = 512 * 1024;
@@ -117,7 +121,7 @@ int get_cpu_num() {
     cpu_num = 1;
   }
   return cpu_num;
-#elif defined(TARGET_IOS)
+#elif defined(TARGET_IOS) || defined(LITE_WITH_M1)
   int cpu_num = 0;
   size_t len = sizeof(cpu_num);
   sysctlbyname("hw.ncpu", &cpu_num, &len, NULL, 0);
@@ -148,7 +152,7 @@ size_t get_mem_size() {
   }
   fclose(fp);
   return memsize;
-#elif defined(TARGET_IOS)
+#elif defined(TARGET_IOS) || defined(LITE_WITH_M1)
   // to be implemented
   printf("not implemented, set to default 4GB\n");
   return 4096 * 1024;
@@ -236,6 +240,10 @@ void get_cpu_arch(std::vector<ARMArch>* archs, const int cpu_num) {
   for (int i = 0; i < cpu_num; ++i) {
     archs->at(i) = kAPPLE;
   }
+#elif defined(LITE_WITH_M1)
+  for (int i = 0; i < cpu_num; ++i) {
+    archs->at(i) = kX1;
+  }
 #endif
 }
 
@@ -1133,6 +1141,10 @@ int DeviceInfo::Setup() {
 #else
 #ifdef TARGET_IOS
   dev_name_ = "Apple";
+#elif defined(LITE_WITH_M1)
+  dev_name_ = "M1";
+  SetDotInfo(1, 1);
+  SetFP16Info(1, 1);
 #else
   dev_name_ = "Unknown";
 #endif
diff --git a/lite/demo/cxx/test_cv/README.md b/lite/demo/cxx/test_cv/README.md
@@ -17,8 +17,17 @@ example:
 wget http://paddle-inference-dist.bj.bcebos.com/mobilenet_v1.tar.gz
 tar zxvf mobilenet_v1.tar.gz
 ./lite/tools/build.sh build_optimize_tool
-# 如果在arm64架构的MacOS下编译Opt工具失败，试着将上一条指令改为
-#"arch -x86_64 ./lite/tools/build.sh build_optimize_tool"
+# 如果在 arm64 架构的 MacOS 下编译 opt 工具失败
+# - 方法1: 试着删除 third-party 目录并重新`git # checkout third-party`，然后将上一条指令改为:
+# ```shell
+# arch -x86_64 ./lite/tools/build.sh # build_optimize_tool
+# ```
+#  该命令会编译 x86 格式的 opt 工具，但是不会影响工具的正常使用，编译成功后，在./build.opt/lite/api目录下，生成了可执行文件 opt
+#- 方法2: 使用 `build_macos.sh` 脚本进行编译
+#```shell
+#./lite/tools/build_macos.sh build_optimize_tool
+#```
+
 ./build.opt/lite/api/opt
 --optimize_out_type=naive_buffer 
 --optimize_out=model_dir 
diff --git a/lite/tools/build_macos.sh b/lite/tools/build_macos.sh
@@ -104,7 +104,6 @@ function prepare_thirdparty {
 function set_benchmark_options {
   BUILD_EXTRA=ON
   WITH_EXCEPTION=ON
-  WITH_OPENCL=ON
   LITE_ON_TINY_PUBLISH=OFF
 
   if [ ${WITH_PROFILE} == "ON" ] || [ ${WITH_PRECISION_PROFILE} == "ON" ]; then
@@ -114,9 +113,30 @@ function set_benchmark_options {
   fi
 }
 
+function build_opt {
+    cd $workspace
+    prepare_thirdparty
+    mkdir -p build.opt
+    cd build.opt
+    opt_arch=$(echo `uname -a` | awk -F " " '{print $15}')
+    with_x86=OFF
+    if [ $opt_arch == "arm64" ]; then
+       with_x86=OFF
+    else
+       with_x86=ON
+    fi
+    cmake .. -DWITH_LITE=ON \
+      -DLITE_ON_MODEL_OPTIMIZE_TOOL=ON \
+      -DWITH_TESTING=OFF \
+      -DLITE_BUILD_EXTRA=ON \
+      -DLITE_WITH_X86=${with_x86} \
+      -DWITH_MKL=OFF
+    make opt -j$NUM_PROC
+}
+
 function make_armosx {
+    prepare_thirdparty
     if [ "${BUILD_PYTHON}" == "ON" ]; then
-      prepare_thirdparty
       BUILD_EXTRA=ON
       LITE_ON_TINY_PUBLISH=OFF
     fi
@@ -176,8 +196,9 @@ function make_armosx {
             -DLITE_WITH_LIGHT_WEIGHT_FRAMEWORK=ON \
             -DLITE_WITH_PRECISION_PROFILE=${WITH_PRECISION_PROFILE} \
             -DLITE_WITH_OPENMP=OFF \
-            -DWITH_ARM_DOTPROD=OFF \
+            -DWITH_ARM_DOTPROD=ON \
             -DLITE_WITH_X86=OFF \
+            -DLITE_WITH_M1=ON \
             -DLITE_WITH_PYTHON=${BUILD_PYTHON} \
             -DPY_VERSION=$PY_VERSION \
             -DLITE_WITH_LOG=$WITH_LOG \
@@ -292,7 +313,7 @@ function print_usage {
     echo -e "|                                                                                                                                      |"
     echo -e "|  for arm macos:                                                                                                                      |"
     echo -e "|  optional argument:                                                                                                                  |"
-    echo -e "|     --with_metal: (OFF|ON); controls whether to build with Metal, default is OFF                                                    |"
+    echo -e "|     --with_metal: (OFF|ON); controls whether to build with Metal, default is OFF                                                     |"
     echo -e "|     --with_cv: (OFF|ON); controls whether to compile cv functions into lib, default is OFF                                           |"
     echo -e "|     --with_log: (OFF|ON); controls whether to print log information, default is ON                                                   |"
     echo -e "|     --with_exception: (OFF|ON); controls whether to throw the exception when error occurs, default is OFF                            |"
@@ -302,6 +323,8 @@ function print_usage {
     echo -e "|     --with_arm82_fp16: (OFF|ON); controls whether to include FP16 kernels, default is OFF                                            |"
     echo -e "|                                  warning: when --with_arm82_fp16=ON, toolchain will be set as clang, arch will be set as armv8.      |"
     echo -e "|                                                                                                                                      |"
+    echo -e "|  compiling for macos OPT tool:                                                                              |"
+    echo -e "|     ./lite/tools/build_macos.sh build_optimize_tool                                                                              |"
     echo -e "|  arguments of benchmark binary compiling for macos x86:                                                                              |"
     echo -e "|     ./lite/tools/build_macos.sh --with_benchmark=ON x86                                                                              |"
     echo -e "|                                                                                                                                      |"
@@ -431,6 +454,10 @@ function main {
                make_x86
                shift
                ;;
+            build_optimize_tool)
+                build_opt
+                shift
+                ;;
             help)
                 print_usage
                 exit 0