Skip to content

Commit cdce912

Browse files
jingxu10mzyczyns
andauthored
Jingxu10/serve pr 21100 (#2327)
* Add example of Bert-Base/Large inference serving with Triton Server with IPEX backend. * Apply requested README.md changes. * Add additional infromation to README. --------- Co-authored-by: Mikolaj Zyczynski <[email protected]>
1 parent ff4f299 commit cdce912

File tree

12 files changed

+717
-0
lines changed

12 files changed

+717
-0
lines changed
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Copyright (c) 2023 Intel Corporation
2+
# SPDX-License-Identifier: Apache 2.0
3+
4+
FROM nvcr.io/nvidia/tritonserver:23.10-py3
5+
6+
COPY requirements.txt requirements.txt
7+
RUN apt-get update && \
8+
apt-get install --no-install-recommends -y numactl \
9+
google-perftools \
10+
python3.9 && \
11+
ln -s /usr/bin/python3.9 /usr/bin/python && \
12+
apt-get clean
13+
14+
RUN python3 -m pip --no-cache-dir install -U --upgrade pip && \
15+
python3 -m pip --no-cache-dir install -U -r requirements.txt
16+
17+
ENV LD_PRELOAD="/usr/local/lib/libiomp5.so:/usr/lib/x86_64-linux-gnu/libtcmalloc.so":${LD_PRELOAD}
18+
ENV KMP_BLOCKTIME=1
19+
ENV KMP_SETTINGS=1
20+
ENV KMP_AFFINITY=granularity=fine,compact,1,0
21+
ENV DNNL_PRIMITIVE_CACHE_CAPACITY=1024
22+
ENV TOKENIZERS_PARALLELISM=true
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Serving BERT models with Triton Server and Intel® Extension for PyTorch optimizations
2+
3+
## Description
4+
This sample provide code to integrate Intel® Extension for PyTorch with Triton Inference Server framework. This project provides custom Python backend for Intel® Extension for PyTorch and additional dynamic batching algorithm to improve the performance. This code can be used as performance benchmark for Bert-Base and Bert-Large models.
5+
6+
![graph](./graph.jpg)
7+
8+
## Preparation
9+
Make sure that Docker is installed on both host and client instance.
10+
In case of running on two separate instances edit config.properties and provide required variables.
11+
## Supported models
12+
Currently AI Inference samples support following Bert models finetuned on Squad dataset:
13+
- bert_base - PyTorch+Intel® Extension for PyTorch [Bert Base uncased](https://huggingface.co/csarron/bert-base-uncased-squad-v1 "Bert Base uncased")
14+
- bert_large - PyTorch+Intel® Extension for PyTorch [Bert Large uncased](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad "Bert Large uncased")
15+
16+
## Possible run scenarios
17+
AI Inference samples allow user to run inference on localhost or on remote Triton Server Host.
18+
By default config.properties is filled with localhost run option.
19+
### Execution on localhost
20+
To build, start Docker containers, run tests, stop and do cleanup on localhost execute scripts in following order:
21+
22+
`$ bash build.sh` - builds Docker image for Triton Server Client and Host with name specified in config.properties
23+
24+
`$ bash start.sh` - runs Docker containers for Triton Server Client and Host for model specified in config.properties
25+
26+
`$ bash run_test.sh` - sends requests to Triton Server Host for model specified in config.properties. Values for sequence length, number of iterations, run mode can be passed as an argument.
27+
28+
`$ sudo bash stop.sh` - stops Docker containers for Triton Server Client and Host for model, and removes temporary files.
29+
30+
### Execution on two separate instances
31+
32+
##### DISCLAIMER: This deployment is designed to be carried out on two distinct machines.
33+
Make sure that IP address for Triton Server Host instance is provided in config.properties on instance with Triton Server Client.
34+
35+
Scripts to run on client Triton Server Host instance:
36+
37+
`$ bash build.sh host` - builds Docker image for Triton Server Host with name specified in config.properties
38+
39+
`$ bash start.sh host` - runs Docker container for localhost Triton Server Host for model specified in config.properties
40+
41+
`$ bash stop.sh host` - (**run after inference is finished**) stops Docker container for Triton Server Host and removes temporary files.
42+
43+
Scripts to run on client Triton Server Client instance:
44+
45+
`$ bash build.sh client` - builds Docker image for Triton Server Client with name specified in config.properties
46+
47+
`$ bash start.sh client` - runs Docker container for Triton Server Client for model specified in config.properties
48+
49+
`$ bash run_test.sh` - sends requests to remote Triton Server Host for model specified in config.properties. Values for sequence length, number of iterations, run mode can be passed as an argument.
50+
51+
`$ bash stop.sh client` - (**run after inference is finished**) stops Docker container for Triton Server Client.
52+
53+
## Additional info
54+
Downloading and loading models take some time, so please wait until you run run_test.sh script.
55+
Model loading progress can be tracked by following Triton Server Host docker container logs.
56+
57+
## License
58+
AI Inference samples project is licensed under Apache License Version 2.0. Refer to the [LICENSE](../LICENSE) file for the full license text and copyright notice.
59+
60+
This distribution includes third party software governed by separate license terms.
61+
62+
3-clause BSD license:
63+
- [model.py](./model_utils/bert_common/1/model.py) - for Intel® Extension for PyTorch optimized workload
64+
65+
This third party software, even if included with the distribution of the Intel software, may be governed by separate license terms, including without limitation, third party license terms, other Intel software license terms, and open source software license terms. These separate license terms govern your use of the third party programs as set forth in the [THIRD-PARTY-PROGRAMS](./THIRD-PARTY-PROGRAMS) file.
66+
67+
## Trademark Information
68+
Intel, the Intel logo and Intel Xeon are trademarks of Intel Corporation or its subsidiaries.
69+
* Other names and brands may be claimed as the property of others.
70+
71+
&copy;Intel Corporation
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
1. model.py (triton/model_utils/bert_common/1/model.py)
2+
3+
Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
4+
5+
Redistribution and use in source and binary forms, with or without
6+
modification, are permitted provided that the following conditions
7+
are met:
8+
* Redistributions of source code must retain the above copyright
9+
notice, this list of conditions and the following disclaimer.
10+
* Redistributions in binary form must reproduce the above copyright
11+
notice, this list of conditions and the following disclaimer in the
12+
documentation and/or other materials provided with the distribution.
13+
* Neither the name of NVIDIA CORPORATION nor the names of its
14+
contributors may be used to endorse or promote products derived
15+
from this software without specific prior written permission.
16+
17+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
18+
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
19+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
20+
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
21+
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
22+
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
23+
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
24+
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
25+
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
26+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/bin/bash
2+
3+
# Copyright (c) 2023 Intel Corporation
4+
# SPDX-License-Identifier: Apache 2.0
5+
6+
source "$(pwd)"/config.properties
7+
8+
echo ""
9+
echo "Building Triton Server container ..."
10+
echo ""
11+
12+
docker build -t "${image_name}${image_tag}" -f ./Dockerfile .
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Copyright (c) 2023 Intel Corporation
2+
# SPDX-License-Identifier: Apache 2.0
3+
4+
# This file contains all customizable configuration items for the project
5+
6+
# Container settings
7+
image_name=ai_inference
8+
image_tag=:v1
9+
10+
# This variable will be used only on Triton Server Host:
11+
## Provide model name to send requests to. To choose with:"
12+
### * bert_base - PyTorch+IPEX bert_base"
13+
### * bert_large - PyTorch+IPEX bert_large"
14+
model_name=bert_base
15+
16+
# This variable will be used only on Triton Server Client:
17+
# Provide host IP address of Triton Host Server
18+
ip_address=localhost
39.7 KB
Loading
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
# Copyright (c) 2023 Intel Corporation
2+
# SPDX-License-Identifier: BSD-3-Clause
3+
4+
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
5+
#
6+
# Redistribution and use in source and binary forms, with or without
7+
# modification, are permitted provided that the following conditions
8+
# are met:
9+
# * Redistributions of source code must retain the above copyright
10+
# notice, this list of conditions and the following disclaimer.
11+
# * Redistributions in binary form must reproduce the above copyright
12+
# notice, this list of conditions and the following disclaimer in the
13+
# documentation and/or other materials provided with the distribution.
14+
# * Neither the name of NVIDIA CORPORATION nor the names of its
15+
# contributors may be used to endorse or promote products derived
16+
# from this software without specific prior written permission.
17+
#
18+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
19+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
21+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
22+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
23+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
24+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
25+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
26+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
27+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
29+
30+
import torch
31+
from torch.utils import dlpack
32+
from transformers import BertModel, AutoConfig
33+
import triton_python_backend_utils as pb_utils
34+
import intel_extension_for_pytorch as ipex
35+
import json
36+
37+
def make_model(model_name, input_shape, device, bfloat16):
38+
print(f"{{ origin: '{model_name}', input shape: {input_shape}, enabled bfloat16: {bfloat16}}}")
39+
# Download PyTorch model
40+
config = AutoConfig.from_pretrained(
41+
model_name, return_dict=False, torchscript=True, num_labels=2)
42+
model = BertModel.from_pretrained(model_name, config=config)
43+
model = model.eval()
44+
vocab_size = model.config.vocab_size
45+
data = torch.randint(vocab_size, size=input_shape)
46+
47+
print('Optimizing model in IPEX:')
48+
try:
49+
model = ipex.optimize(model, level="O1",auto_kernel_selection=True, conv_bn_folding=False, dtype=torch.bfloat16 if bfloat16 else torch.float32)
50+
with torch.no_grad(), torch.cpu.amp.autocast(enabled=bfloat16):
51+
model = torch.jit.trace(model, data, check_trace=False, strict=False)
52+
model = torch.jit.freeze(model)
53+
except Exception as e: print(e)
54+
55+
print('Trigger Init Model Execution')
56+
# Enable fusion path (need to run forward propagation twice)
57+
with torch.no_grad(), torch.cpu.amp.autocast(enabled=bfloat16):
58+
model(data)
59+
model(data)
60+
return model.to(device)
61+
62+
def compute_batch_set(full_batch, batches):
63+
if batches is None or len(batches) == 0:
64+
return [full_batch,]
65+
66+
batches = sorted(batches, reverse=True)
67+
batch_set = []
68+
rest_batch = full_batch
69+
for batch in batches:
70+
batch_set += [batch] * (rest_batch // batch)
71+
rest_batch %= batch
72+
if rest_batch == 0:
73+
break
74+
75+
return batch_set
76+
77+
def execute_model(models, inputs, batches, dynamic_shape, bfloat16):
78+
input_batches = [x.shape[0] for x in inputs]
79+
80+
# Join all inputs into 1 torch.Tensor
81+
all_inputs = torch.concat(inputs, 0)
82+
83+
# Split combined inputs into batch set
84+
full_batch = all_inputs.shape[0]
85+
86+
if not dynamic_shape:
87+
batches = models.keys()
88+
89+
splits = compute_batch_set(full_batch, batches)
90+
splitted_inputs = torch.split(all_inputs, splits)
91+
92+
# Execute the model
93+
model_outputs = []
94+
with torch.no_grad(), torch.cpu.amp.autocast(enabled=bfloat16):
95+
for i in range(len(splits)):
96+
inp = splitted_inputs[i]
97+
out = models[0 if dynamic_shape else splits[i]](inp)[1]
98+
model_outputs.append(out)
99+
100+
# Re-combine results
101+
full_output = torch.concat(model_outputs, 0)
102+
outputs = torch.split(full_output, input_batches)
103+
104+
return outputs
105+
106+
107+
class TritonPythonModel:
108+
"""Your Python model must use the same class name. Every Python model
109+
that is created must have "TritonPythonModel" as the class name.
110+
"""
111+
112+
def initialize(self, args):
113+
"""`initialize` is called only once when the model is being loaded.
114+
Implementing `initialize` function is optional. This function allows
115+
the model to intialize any state associated with this model.
116+
117+
Parameters
118+
----------
119+
args : dict
120+
Both keys and values are strings. The dictionary keys and values are:
121+
* model_config: A JSON string containing the model configuration
122+
* model_instance_kind: A string containing model instance kind
123+
* model_instance_device_id: A string containing model instance device ID
124+
* model_repository: Model repository path
125+
* model_version: Model version
126+
* model_name: Model name
127+
"""
128+
129+
# You must parse model_config. JSON string is not parsed here
130+
self.model_config = json.loads(args['model_config'])
131+
self.device = torch.device('cpu')
132+
133+
# Get INPUT0 configuration
134+
input0_config = pb_utils.get_input_config_by_name(
135+
self.model_config, "INPUT0")
136+
seq_length = input0_config['dims'][0]
137+
138+
# Get OUTPUT0 configuration
139+
output0_config = pb_utils.get_output_config_by_name(
140+
self.model_config, "OUTPUT0")
141+
142+
# Convert Triton types to numpy types
143+
self.output0_dtype = pb_utils.triton_string_to_numpy(
144+
output0_config['data_type'])
145+
146+
self.batches = []
147+
self.dynamic_shape = True
148+
self.bfloat16 = False
149+
parameters = self.model_config['parameters']
150+
151+
if 'origin' in parameters:
152+
origin = parameters['origin']['string_value']
153+
else:
154+
raise pb_utils.TritonModelException("Origin model name should be defined")
155+
156+
if 'batches' in parameters:
157+
self.batches = json.loads(parameters['batches']['string_value'])
158+
159+
if 'dynamic_shape' in parameters:
160+
self.dynamic_shape = json.loads(parameters['dynamic_shape']['string_value'])
161+
162+
if 'bfloat16' in parameters:
163+
self.bfloat16 = json.loads(parameters['bfloat16']['string_value'])
164+
165+
self.models_cpu = dict()
166+
# Dynamic shapes supported in fp32/bf6 mode for PyTorch+IPEX
167+
if self.dynamic_shape:
168+
input_shape = [1, seq_length if seq_length > 0 else 128]
169+
self.models_cpu[0] = make_model(origin, input_shape, self.device, self.bfloat16)
170+
171+
else:
172+
if seq_length <= 0:
173+
raise pb_utils.TritonModelException("Dynamic shapes switched off but input size is not defined")
174+
175+
if self.batches is None or len(self.batches) == 0:
176+
self.batches = [1]
177+
178+
for batch in self.batches:
179+
input_shape = [batch, seq_length]
180+
self.models_cpu[batch] = make_model(origin, input_shape, self.device, self.bfloat16)
181+
182+
def execute(self, requests):
183+
"""`execute` must be implemented in every Python model. `execute`
184+
function receives a list of pb_utils.InferenceRequest as the only
185+
argument. This function is called when an inference is requested
186+
for this model. Depending on the batching configuration (e.g. Dynamic
187+
Batching) used, `requests` may contain multiple requests. Every
188+
Python model, must create one pb_utils.InferenceResponse for every
189+
pb_utils.InferenceRequest in `requests`. If there is an error, you can
190+
set the error argument when creating a pb_utils.InferenceResponse.
191+
192+
Parameters
193+
----------
194+
requests : list
195+
A list of pb_utils.InferenceRequest
196+
197+
Returns
198+
-------
199+
list
200+
A list of pb_utils.InferenceResponse. The length of this list must
201+
be the same as `requests`
202+
"""
203+
# Make the list of inputs in form of torch.Tensor
204+
inputs = []
205+
for request in requests:
206+
# Get INPUT0
207+
in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0").to_dlpack()
208+
in_0_cpu = dlpack.from_dlpack(in_0).to(self.device)
209+
inputs.append(in_0_cpu)
210+
211+
outputs = execute_model(self.models_cpu, inputs, self.batches, self.dynamic_shape, self.bfloat16)
212+
213+
# Convert model outputs to triton responses
214+
responses = []
215+
for cur_bert_output in outputs:
216+
pooler_output = cur_bert_output.cpu().detach().numpy()
217+
out_tensor_0 = pb_utils.Tensor("OUTPUT0", pooler_output.astype(self.output0_dtype))
218+
219+
inference_response = pb_utils.InferenceResponse(
220+
output_tensors=[out_tensor_0])
221+
responses.append(inference_response)
222+
223+
return responses
224+
225+
def finalize(self):
226+
"""`finalize` is called only once when the model is being unloaded.
227+
Implementing `finalize` function is optional. This function allows
228+
the model to perform any necessary clean ups before exit.
229+
"""
230+
print('Cleaning up...')

0 commit comments

Comments
 (0)