add triton service for ernie-3.0 (#2350)

heliqi · web-flow · commit 0a94ab4360a8 · 2022-06-02T16:07:31.000+08:00
* add triton service for ernie-3.0

* modify triton code

* Optimize the triton server code

* Optimize the triton server code
diff --git a/model_zoo/ernie-3.0/README.md b/model_zoo/ernie-3.0/README.md
@@ -917,7 +917,8 @@ Python部署请参考：[Python部署指南](./deploy/python/README.md)
 
 ### 服务化部署
 
-服务化部署请参考：[服务化部署指南](./deploy/serving/README.md)
+- [Triton Inference Server服务化部署指南](./deploy/triton/README.md)
+- [Paddle Serving服务化部署指南](./deploy/serving/README.md)
 
 <a name="Paddle2ONNX部署"></a>
 
diff --git a/model_zoo/ernie-3.0/deploy/triton/README.md b/model_zoo/ernie-3.0/deploy/triton/README.md
@@ -0,0 +1,128 @@
+# 基于Triton Inference Server的服务化部署
+
+本文档将介绍如何使用[Triton Inference Server](https://github.com/triton-inference-server/server)工具部署ERNIE 3.0新闻分类和序列标注模型的pipeline在线服务。
+
+## 目录
+- [环境准备](#环境准备)
+- [模型转换](#模型转换)
+- [部署模型](#部署模型)
+
+## 环境准备
+需要[准备PaddleNLP的运行环境](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)和Triton Server的运行环境。
+
+### 安装Triton Server
+下载Triton Server镜像，并启动
+```
+# 拉取镜像
+docker pull nvcr.io/nvidia/tritonserver:21.10-py3
+
+# 启动容器
+docker run  -it --net=host --name triton_server -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash
+```
+Triton版本号`21.10`可以根据自己的需求调整，各个Triton版本对应的Driver、CUDA、TRT和ONNX Runtime等后端版本可以参考[官网文档](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).注意其中的`NVIDIA Driver`行，如果NVIDIA Driver低于文档中要求，在启动运行时会报错
+
+### 进入容器并准备PaddleNLP环境
+整个服务的前后处理依赖PaddleNLP，需要在容器内安装相关python包
+```
+# 进入容器
+docker exec -it triton_server bash
+
+# 安装PaddleNLP
+python3 -m pip install paddlenlp
+```
+
+### 安装FasterTokenizers文本处理加速库（可选）
+如果部署环境是Linux，推荐安装faster_tokenizers可以得到更极致的文本处理效率，进一步提升服务性能。目前暂不支持Windows设备安装，将会在下个版本支持。
+```
+# 注意：在容器内安装
+python3 -m pip install faster_tokenizers
+```
+
+
+## 模型获取和转换
+
+使用Triton做服务化部署时，选择ONNX Runtime后端运行需要先将模型转换成ONN格式。
+
+下载ERNIE 3.0的新闻分类模型(如果有已训练好的模型，跳过此步骤):
+```bash
+# 下载并解压新闻分类模型
+wget https://paddlenlp.bj.bcebos.com/models/transformers/ernie_3.0/tnews_pruned_infer_model.zip
+unzip tnews_pruned_infer_model.zip
+```
+
+使用Paddle2ONNX将Paddle静态图模型转换为ONNX模型格式的命令如下，以下命令成功运行后，将会在当前目录下生成model.onnx模型文件。
+```bash
+# 模型地址根据实际填写即可
+# 转换新闻分类模型
+paddle2onnx --model_dir tnews_pruned_infer_model/ --model_filename float32.pdmodel --params_filename float32.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True
+
+# 将转换好的ONNX模型移动到模型仓库目录
+mv model.onnx /models/ernie_seqcls_model/1
+```
+Paddle2ONNX的命令行参数说明请查阅：[Paddle2ONNX命令行参数说明](https://github.com/PaddlePaddle/Paddle2ONNX#%E5%8F%82%E6%95%B0%E9%80%89%E9%A1%B9)
+
+模型下载转换好之后，models目录结构如下:
+```
+models
+├── ernie_seqcls
+│   ├── 1
+│   └── config.pbtxt
+├── ernie_seqcls_model
+│   ├── 1
+│   │   └── model.onnx
+│   └── config.pbtxt
+├── ernie_seqcls_postprocess
+│   ├── 1
+│   │   └── model.py
+│   └── config.pbtxt
+└── ernie_tokenizer
+    ├── 1
+    │   └── model.py
+    └── config.pbtxt
+```
+
+## 部署模型
+
+triton目录包含启动pipeline服务的配置和发送预测请求的代码，包括：
+
+```
+models                    # Triton启动需要的模型仓库，包含模型和服务配置文件
+seq_cls_rpc_client.py     # 新闻分类任务发送pipeline预测请求的脚本
+```
+
+### 启动服务
+
+在容器内执行下面命令启动服务:
+```
+tritonserver --model-repository=/models
+```
+输出打印如下:
+```
+I0601 08:08:27.951220 8697 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f5c1c000000' with size 268435456
+I0601 08:08:27.953774 8697 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
+I0601 08:08:27.958255 8697 model_repository_manager.cc:1022] loading: ernie_seqcls_postprocess:1
+I0601 08:08:28.058467 8697 model_repository_manager.cc:1022] loading: ernie_seqcls_model:1
+I0601 08:08:28.062170 8697 python.cc:1875] TRITONBACKEND_ModelInstanceInitialize: ernie_seqcls_postprocess_0 (CPU device 0)
+I0601 08:08:28.158848 8697 model_repository_manager.cc:1022] loading: ernie_tokenizer:1
+...
+I0601 07:15:15.923270 8059 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
+I0601 07:15:15.923604 8059 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
+I0601 07:15:15.964984 8059 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
+```
+
+*注意:*启动服务时，Triton Server的每个python后端进程默认申请`64M`内存，默认启动的docker无法启动多个python后端节点。两个解决方案：
+- 1.启动容器时设置`shm-size`参数, 比如:`docker run  -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash`
+- 2.启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M： `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760`
+
+
+#### 启动client测试
+注意执行客户端请求时关闭代理，并根据实际情况修改main函数中的ip地址(启动服务所在的机器)
+```
+python seq_cls_grpc_client.py
+```
+输出打印如下:
+```
+{'label': array([5, 9]), 'confidence': array([0.6425664 , 0.66534853], dtype=float32)}
+{'label': array([4]), 'confidence': array([0.53198355], dtype=float32)}
+acc: 0.5731
+```
diff --git a/model_zoo/ernie-3.0/deploy/triton/models/ernie_seqcls/config.pbtxt b/model_zoo/ernie-3.0/deploy/triton/models/ernie_seqcls/config.pbtxt
@@ -0,0 +1,75 @@
+name: "ernie_seqcls"
+platform: "ensemble"
+max_batch_size: 64
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "label"
+    data_type: TYPE_INT64
+    dims: [ 1 ]
+  },
+  {
+    name: "confidence"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "ernie_tokenizer"
+      model_version: 1
+      input_map {
+        key: "INPUT_0"
+        value: "INPUT"
+      }
+      output_map {
+        key: "OUTPUT_0"
+        value: "tokenizer_input_ids"
+      }
+      output_map {
+        key: "OUTPUT_1"
+        value: "tokenizer_token_type_ids"
+      }
+    },
+    {
+      model_name: "ernie_seqcls_model"
+      model_version: 1
+      input_map {
+        key: "input_ids"
+        value: "tokenizer_input_ids"
+      }
+      input_map {
+        key: "token_type_ids"
+        value: "tokenizer_token_type_ids"
+      }
+      output_map {
+        key: "linear_113.tmp_1"
+        value: "OUTPUT_2"
+      }
+    },
+    {
+      model_name: "ernie_seqcls_postprocess"
+      model_version: 1
+      input_map {
+        key: "POST_INPUT"
+        value: "OUTPUT_2"
+      }
+      output_map {
+        key: "POST_label"
+        value: "label"
+      }
+      output_map {
+        key: "POST_confidence"
+        value: "confidence"
+      }
+    }
+  ]
+}
+
diff --git a/model_zoo/ernie-3.0/deploy/triton/models/ernie_seqcls_model/config.pbtxt b/model_zoo/ernie-3.0/deploy/triton/models/ernie_seqcls_model/config.pbtxt
@@ -0,0 +1,36 @@
+platform: "onnxruntime_onnx"
+max_batch_size: 64
+input [
+    {
+      name: "input_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    },
+    {
+      name: "token_type_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    }
+]
+output [
+    {
+      name: "linear_113.tmp_1"
+      data_type: TYPE_FP32
+      dims: [ 15 ]
+    }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_GPU
+  }
+]
+
+optimization { 
+  graph: {level: -1}
+}
+
+parameters { key: "intra_op_thread_count" value: { string_value: "0" } }
+parameters { key: "execution_mode" value: { string_value: "0" } }
+parameters { key: "inter_op_thread_count" value: { string_value: "0" } }
diff --git a/model_zoo/ernie-3.0/deploy/triton/models/ernie_seqcls_postprocess/1/model.py b/model_zoo/ernie-3.0/deploy/triton/models/ernie_seqcls_postprocess/1/model.py
@@ -0,0 +1,94 @@
+import json
+import paddle
+import numpy as np
+import time
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to intialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.model_config = model_config = json.loads(args['model_config'])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        # print("num:", len(requests), flush=True)
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request,
+                                                     self.input_names[0])
+            data = data.as_numpy()
+            # print("post data:", data)
+            max_value = np.max(data, axis=1, keepdims=True)
+            exp_data = np.exp(data - max_value)
+            probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+            probs = probs.max(axis=-1)
+            # print("label:", data.argmax(axis=-1))
+            # print("probs:", probs)
+            out_tensor1 = pb_utils.Tensor(
+                self.output_names[0], data.argmax(axis=-1))
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], probs)
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print('Cleaning up...')
diff --git a/model_zoo/ernie-3.0/deploy/triton/models/ernie_seqcls_postprocess/config.pbtxt b/model_zoo/ernie-3.0/deploy/triton/models/ernie_seqcls_postprocess/config.pbtxt
@@ -0,0 +1,31 @@
+name: "ernie_seqcls_postprocess"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "POST_INPUT"
+    data_type: TYPE_FP32
+    dims: [ 15 ]
+  }
+]
+
+output [
+  {
+    name: "POST_label"
+    data_type: TYPE_INT64
+    dims: [ 1 ]
+  },
+  {
+    name: "POST_confidence"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/model_zoo/ernie-3.0/deploy/triton/models/ernie_tokenizer/1/model.py b/model_zoo/ernie-3.0/deploy/triton/models/ernie_tokenizer/1/model.py
diff --git a/model_zoo/ernie-3.0/deploy/triton/models/ernie_tokenizer/config.pbtxt b/model_zoo/ernie-3.0/deploy/triton/models/ernie_tokenizer/config.pbtxt
diff --git a/model_zoo/ernie-3.0/deploy/triton/seq_cls_grpc_client.py b/model_zoo/ernie-3.0/deploy/triton/seq_cls_grpc_client.py