Skip to content

Commit 0a94ab4

Browse files
authored
add triton service for ernie-3.0 (#2350)
* add triton service for ernie-3.0 * modify triton code * Optimize the triton server code * Optimize the triton server code
1 parent 56783d9 commit 0a94ab4

File tree

9 files changed

+635
-1
lines changed

9 files changed

+635
-1
lines changed

model_zoo/ernie-3.0/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -917,7 +917,8 @@ Python部署请参考:[Python部署指南](./deploy/python/README.md)
917917

918918
### 服务化部署
919919

920-
服务化部署请参考:[服务化部署指南](./deploy/serving/README.md)
920+
- [Triton Inference Server服务化部署指南](./deploy/triton/README.md)
921+
- [Paddle Serving服务化部署指南](./deploy/serving/README.md)
921922

922923
<a name="Paddle2ONNX部署"></a>
923924

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# 基于Triton Inference Server的服务化部署
2+
3+
本文档将介绍如何使用[Triton Inference Server](https://github.com/triton-inference-server/server)工具部署ERNIE 3.0新闻分类和序列标注模型的pipeline在线服务。
4+
5+
## 目录
6+
- [环境准备](#环境准备)
7+
- [模型转换](#模型转换)
8+
- [部署模型](#部署模型)
9+
10+
## 环境准备
11+
需要[准备PaddleNLP的运行环境](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)和Triton Server的运行环境。
12+
13+
### 安装Triton Server
14+
下载Triton Server镜像,并启动
15+
```
16+
# 拉取镜像
17+
docker pull nvcr.io/nvidia/tritonserver:21.10-py3
18+
19+
# 启动容器
20+
docker run -it --net=host --name triton_server -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash
21+
```
22+
Triton版本号`21.10`可以根据自己的需求调整,各个Triton版本对应的Driver、CUDA、TRT和ONNX Runtime等后端版本可以参考[官网文档](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).注意其中的`NVIDIA Driver`行,如果NVIDIA Driver低于文档中要求,在启动运行时会报错
23+
24+
### 进入容器并准备PaddleNLP环境
25+
整个服务的前后处理依赖PaddleNLP,需要在容器内安装相关python包
26+
```
27+
# 进入容器
28+
docker exec -it triton_server bash
29+
30+
# 安装PaddleNLP
31+
python3 -m pip install paddlenlp
32+
```
33+
34+
### 安装FasterTokenizers文本处理加速库(可选)
35+
如果部署环境是Linux,推荐安装faster_tokenizers可以得到更极致的文本处理效率,进一步提升服务性能。目前暂不支持Windows设备安装,将会在下个版本支持。
36+
```
37+
# 注意:在容器内安装
38+
python3 -m pip install faster_tokenizers
39+
```
40+
41+
42+
## 模型获取和转换
43+
44+
使用Triton做服务化部署时,选择ONNX Runtime后端运行需要先将模型转换成ONN格式。
45+
46+
下载ERNIE 3.0的新闻分类模型(如果有已训练好的模型,跳过此步骤):
47+
```bash
48+
# 下载并解压新闻分类模型
49+
wget https://paddlenlp.bj.bcebos.com/models/transformers/ernie_3.0/tnews_pruned_infer_model.zip
50+
unzip tnews_pruned_infer_model.zip
51+
```
52+
53+
使用Paddle2ONNX将Paddle静态图模型转换为ONNX模型格式的命令如下,以下命令成功运行后,将会在当前目录下生成model.onnx模型文件。
54+
```bash
55+
# 模型地址根据实际填写即可
56+
# 转换新闻分类模型
57+
paddle2onnx --model_dir tnews_pruned_infer_model/ --model_filename float32.pdmodel --params_filename float32.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True
58+
59+
# 将转换好的ONNX模型移动到模型仓库目录
60+
mv model.onnx /models/ernie_seqcls_model/1
61+
```
62+
Paddle2ONNX的命令行参数说明请查阅:[Paddle2ONNX命令行参数说明](https://github.com/PaddlePaddle/Paddle2ONNX#%E5%8F%82%E6%95%B0%E9%80%89%E9%A1%B9)
63+
64+
模型下载转换好之后,models目录结构如下:
65+
```
66+
models
67+
├── ernie_seqcls
68+
│   ├── 1
69+
│   └── config.pbtxt
70+
├── ernie_seqcls_model
71+
│   ├── 1
72+
│   │   └── model.onnx
73+
│   └── config.pbtxt
74+
├── ernie_seqcls_postprocess
75+
│   ├── 1
76+
│   │   └── model.py
77+
│   └── config.pbtxt
78+
└── ernie_tokenizer
79+
├── 1
80+
│   └── model.py
81+
└── config.pbtxt
82+
```
83+
84+
## 部署模型
85+
86+
triton目录包含启动pipeline服务的配置和发送预测请求的代码,包括:
87+
88+
```
89+
models # Triton启动需要的模型仓库,包含模型和服务配置文件
90+
seq_cls_rpc_client.py # 新闻分类任务发送pipeline预测请求的脚本
91+
```
92+
93+
### 启动服务
94+
95+
在容器内执行下面命令启动服务:
96+
```
97+
tritonserver --model-repository=/models
98+
```
99+
输出打印如下:
100+
```
101+
I0601 08:08:27.951220 8697 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f5c1c000000' with size 268435456
102+
I0601 08:08:27.953774 8697 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
103+
I0601 08:08:27.958255 8697 model_repository_manager.cc:1022] loading: ernie_seqcls_postprocess:1
104+
I0601 08:08:28.058467 8697 model_repository_manager.cc:1022] loading: ernie_seqcls_model:1
105+
I0601 08:08:28.062170 8697 python.cc:1875] TRITONBACKEND_ModelInstanceInitialize: ernie_seqcls_postprocess_0 (CPU device 0)
106+
I0601 08:08:28.158848 8697 model_repository_manager.cc:1022] loading: ernie_tokenizer:1
107+
...
108+
I0601 07:15:15.923270 8059 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
109+
I0601 07:15:15.923604 8059 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
110+
I0601 07:15:15.964984 8059 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
111+
```
112+
113+
*注意:*启动服务时,Triton Server的每个python后端进程默认申请`64M`内存,默认启动的docker无法启动多个python后端节点。两个解决方案:
114+
- 1.启动容器时设置`shm-size`参数, 比如:`docker run -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash`
115+
- 2.启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M: `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760`
116+
117+
118+
#### 启动client测试
119+
注意执行客户端请求时关闭代理,并根据实际情况修改main函数中的ip地址(启动服务所在的机器)
120+
```
121+
python seq_cls_grpc_client.py
122+
```
123+
输出打印如下:
124+
```
125+
{'label': array([5, 9]), 'confidence': array([0.6425664 , 0.66534853], dtype=float32)}
126+
{'label': array([4]), 'confidence': array([0.53198355], dtype=float32)}
127+
acc: 0.5731
128+
```
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
name: "ernie_seqcls"
2+
platform: "ensemble"
3+
max_batch_size: 64
4+
input [
5+
{
6+
name: "INPUT"
7+
data_type: TYPE_STRING
8+
dims: [ 1 ]
9+
}
10+
]
11+
output [
12+
{
13+
name: "label"
14+
data_type: TYPE_INT64
15+
dims: [ 1 ]
16+
},
17+
{
18+
name: "confidence"
19+
data_type: TYPE_FP32
20+
dims: [ 1 ]
21+
}
22+
]
23+
ensemble_scheduling {
24+
step [
25+
{
26+
model_name: "ernie_tokenizer"
27+
model_version: 1
28+
input_map {
29+
key: "INPUT_0"
30+
value: "INPUT"
31+
}
32+
output_map {
33+
key: "OUTPUT_0"
34+
value: "tokenizer_input_ids"
35+
}
36+
output_map {
37+
key: "OUTPUT_1"
38+
value: "tokenizer_token_type_ids"
39+
}
40+
},
41+
{
42+
model_name: "ernie_seqcls_model"
43+
model_version: 1
44+
input_map {
45+
key: "input_ids"
46+
value: "tokenizer_input_ids"
47+
}
48+
input_map {
49+
key: "token_type_ids"
50+
value: "tokenizer_token_type_ids"
51+
}
52+
output_map {
53+
key: "linear_113.tmp_1"
54+
value: "OUTPUT_2"
55+
}
56+
},
57+
{
58+
model_name: "ernie_seqcls_postprocess"
59+
model_version: 1
60+
input_map {
61+
key: "POST_INPUT"
62+
value: "OUTPUT_2"
63+
}
64+
output_map {
65+
key: "POST_label"
66+
value: "label"
67+
}
68+
output_map {
69+
key: "POST_confidence"
70+
value: "confidence"
71+
}
72+
}
73+
]
74+
}
75+
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
platform: "onnxruntime_onnx"
2+
max_batch_size: 64
3+
input [
4+
{
5+
name: "input_ids"
6+
data_type: TYPE_INT64
7+
dims: [ -1 ]
8+
},
9+
{
10+
name: "token_type_ids"
11+
data_type: TYPE_INT64
12+
dims: [ -1 ]
13+
}
14+
]
15+
output [
16+
{
17+
name: "linear_113.tmp_1"
18+
data_type: TYPE_FP32
19+
dims: [ 15 ]
20+
}
21+
]
22+
23+
instance_group [
24+
{
25+
count: 1
26+
kind: KIND_GPU
27+
}
28+
]
29+
30+
optimization {
31+
graph: {level: -1}
32+
}
33+
34+
parameters { key: "intra_op_thread_count" value: { string_value: "0" } }
35+
parameters { key: "execution_mode" value: { string_value: "0" } }
36+
parameters { key: "inter_op_thread_count" value: { string_value: "0" } }
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
import json
2+
import paddle
3+
import numpy as np
4+
import time
5+
6+
# triton_python_backend_utils is available in every Triton Python model. You
7+
# need to use this module to create inference requests and responses. It also
8+
# contains some utility functions for extracting information from model_config
9+
# and converting Triton input/output types to numpy types.
10+
import triton_python_backend_utils as pb_utils
11+
12+
13+
class TritonPythonModel:
14+
"""Your Python model must use the same class name. Every Python model
15+
that is created must have "TritonPythonModel" as the class name.
16+
"""
17+
18+
def initialize(self, args):
19+
"""`initialize` is called only once when the model is being loaded.
20+
Implementing `initialize` function is optional. This function allows
21+
the model to intialize any state associated with this model.
22+
Parameters
23+
----------
24+
args : dict
25+
Both keys and values are strings. The dictionary keys and values are:
26+
* model_config: A JSON string containing the model configuration
27+
* model_instance_kind: A string containing model instance kind
28+
* model_instance_device_id: A string containing model instance device ID
29+
* model_repository: Model repository path
30+
* model_version: Model version
31+
* model_name: Model name
32+
"""
33+
self.model_config = model_config = json.loads(args['model_config'])
34+
print("model_config:", self.model_config)
35+
36+
self.input_names = []
37+
for input_config in self.model_config["input"]:
38+
self.input_names.append(input_config["name"])
39+
print("input:", self.input_names)
40+
41+
self.output_names = []
42+
self.output_dtype = []
43+
for output_config in self.model_config["output"]:
44+
self.output_names.append(output_config["name"])
45+
dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
46+
self.output_dtype.append(dtype)
47+
print("output:", self.output_names)
48+
49+
def execute(self, requests):
50+
"""`execute` must be implemented in every Python model. `execute`
51+
function receives a list of pb_utils.InferenceRequest as the only
52+
argument. This function is called when an inference is requested
53+
for this model. Depending on the batching configuration (e.g. Dynamic
54+
Batching) used, `requests` may contain multiple requests. Every
55+
Python model, must create one pb_utils.InferenceResponse for every
56+
pb_utils.InferenceRequest in `requests`. If there is an error, you can
57+
set the error argument when creating a pb_utils.InferenceResponse.
58+
Parameters
59+
----------
60+
requests : list
61+
A list of pb_utils.InferenceRequest
62+
Returns
63+
-------
64+
list
65+
A list of pb_utils.InferenceResponse. The length of this list must
66+
be the same as `requests`
67+
"""
68+
responses = []
69+
# print("num:", len(requests), flush=True)
70+
for request in requests:
71+
data = pb_utils.get_input_tensor_by_name(request,
72+
self.input_names[0])
73+
data = data.as_numpy()
74+
# print("post data:", data)
75+
max_value = np.max(data, axis=1, keepdims=True)
76+
exp_data = np.exp(data - max_value)
77+
probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
78+
probs = probs.max(axis=-1)
79+
# print("label:", data.argmax(axis=-1))
80+
# print("probs:", probs)
81+
out_tensor1 = pb_utils.Tensor(
82+
self.output_names[0], data.argmax(axis=-1))
83+
out_tensor2 = pb_utils.Tensor(self.output_names[1], probs)
84+
inference_response = pb_utils.InferenceResponse(
85+
output_tensors=[out_tensor1, out_tensor2])
86+
responses.append(inference_response)
87+
return responses
88+
89+
def finalize(self):
90+
"""`finalize` is called only once when the model is being unloaded.
91+
Implementing `finalize` function is optional. This function allows
92+
the model to perform any necessary clean ups before exit.
93+
"""
94+
print('Cleaning up...')
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: "ernie_seqcls_postprocess"
2+
backend: "python"
3+
max_batch_size: 64
4+
5+
input [
6+
{
7+
name: "POST_INPUT"
8+
data_type: TYPE_FP32
9+
dims: [ 15 ]
10+
}
11+
]
12+
13+
output [
14+
{
15+
name: "POST_label"
16+
data_type: TYPE_INT64
17+
dims: [ 1 ]
18+
},
19+
{
20+
name: "POST_confidence"
21+
data_type: TYPE_FP32
22+
dims: [ 1 ]
23+
}
24+
]
25+
26+
instance_group [
27+
{
28+
count: 1
29+
kind: KIND_CPU
30+
}
31+
]

0 commit comments

Comments
 (0)