Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Commit 9930919

Browse files
authored
Merge pull request #232 from panchul/triton
Adding tutorial on KFServing InferenceService using Triton Nvidia server.
2 parents 00b715b + a695bbc commit 9930919

File tree

12 files changed

+1233
-0
lines changed

12 files changed

+1233
-0
lines changed

Research/kubeflow-on-azure-stack-lab/04-KFServing/Readme.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,14 @@ Please see a separate sub-page, [KFServing models saved in ONNX format](onnx.md)
9797

9898
---
9999

100+
### KFServing using Triton
101+
102+
Triton is a high-performance inferencing server from NVIDIA
103+
104+
Please see a separate sub-page, [KFServing using Triton](triton/Readme.md)
105+
106+
---
107+
100108
### KFServing model SKLearn Iris model
101109

102110
Let us walk through a demo for SKLearn, which is similar for other ML frameworks.
@@ -219,6 +227,7 @@ See https://github.com/kubeflow/kfserving for more details.
219227
- [https://www.kubeflow.org/docs/components/serving/kfserving/](https://www.kubeflow.org/docs/components/serving/kfserving/)
220228
- [Kafka Event Source](https://github.com/knative/eventing-contrib/tree/master/kafka/source)
221229
- [knative client](https://github.com/knative/client)
230+
- https://developer.nvidia.com/nvidia-triton-inference-server
222231

223232
---
224233

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# KFServing using Triton
2+
3+
[Nvidia Inference Server Triton](https://developer.nvidia.com/nvidia-triton-inference-server) is a high-performing
4+
batch-inferencing deployment we can start using KFServing.
5+
6+
We will be following examples from KFServing Github repository, and encourage you to look at
7+
other examples if you face any problems: https://github.com/kubeflow/kfserving/tree/master/docs/samples
8+
9+
BERT(Bidirectional Embedding Representations from Transformers), is working on Natural Language Processing tasks.
10+
11+
# Pre-requisites
12+
13+
You need to have Kubeflow version 1.2 or later, which should have KFServing version 0.4 or later.
14+
15+
Earlier name used for `Triton` was `TensorRT`, it may not be compatible with what we will be demoing.
16+
17+
# Preparing the environment
18+
19+
We have to make some changes to our environment.
20+
21+
We need to skip tag resolution for nvcr.io:
22+
23+
$ kubectl patch cm config-deployment --patch '{"data":{"registriesSkippingTagResolving":"nvcr.io"}}' -n knative-serving
24+
25+
And increase the timeout for image pulling, because BERT model we will use is large
26+
27+
$ kubectl patch cm config-deployment --patch '{"data":{"progressDeadline": "600s"}}' -n knative-serving
28+
29+
# (Optional) Extending `kfserving.KFModel`
30+
31+
You can define your own `pre/postprocess` and `predict` functions, and prepare the image for the transformer.
32+
33+
For the sake of simplicity we will use a pre-built image gcr.io/kubeflow-ci/kfserving/bert-transformer:latest.
34+
35+
If you want to build your own you can do this and update the image name in the .yaml:
36+
37+
$ cd triton_bert_tokenizer
38+
$ docker build -t rollingstone/bert_transformer:latest . --rm
39+
40+
(replacing `rollingstone` with your own DockerHub account name or an ACR of your choosing)
41+
42+
# Deploying the inferenceservice for Triton
43+
44+
If you remember, in our environment we created a separate namespace `kfserving-test` for inferencing,
45+
we will deploy the Triton deployment into it, like so:
46+
47+
$ kubectl create -f triton_bert.yaml -n kfserving-test
48+
inferenceservice.serving.kubeflow.org/bert-large created
49+
50+
You will need to wait until the `inferencingservice` will become `READY`. For troubleshooting
51+
you can look at the health of the pods.
52+
53+
$ kubectl get inferenceservice -n kfserving-test
54+
NAME URL READY DEFAULT TRAFFIC CANARY TRAFFIC AGE
55+
bert-large http://bert-large.kfserving-test.example.com True 100 1m
56+
57+
In a few minutes you should see something like this:
58+
59+
$ kubectl get revision -l serving.kubeflow.org/inferenceservice=bert-large -n kfserving-test
60+
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON
61+
bert-large-predictor-default-9jcrq bert-large-predictor-default bert-large-predictor-default-9jcrq 1 True
62+
bert-large-transformer-default-hvwjq bert-large-transformer-default bert-large-transformer-default-hvwjq 1 True
63+
64+
# Running inferencing
65+
66+
If you remember, BERT is working on NLP. The input we will give it will be a text question.
67+
68+
$ cat triton_input.json
69+
{
70+
"instances": [
71+
"What President is credited with the original notion of putting Americans in space?"
72+
]
73+
}
74+
75+
As with previous examples, we need to have the `INGRESS_HOST` and `INGRESS_PORT` defined. We used the following, but
76+
in your environment it could be different depending on the flavor of your Kubernetes cluster and i/o layers within it:
77+
78+
$ INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
79+
$ INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
80+
81+
And let us set the following:
82+
83+
$ MODEL_NAME=bert-large
84+
$ INPUT_PATH=@./triton_input.json
85+
$ SERVICE_HOSTNAME=$(kubectl get inferenceservices -n kfserving-test bert-large -o jsonpath='{.status.url}' | cut -d "/" -f 3)
86+
87+
You should see the service `Alive`:
88+
89+
$ curl -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}
90+
Alive
91+
92+
And now we can run the web requests for the inferencing server we created:
93+
94+
$ curl -v -H "Host: ${SERVICE_HOSTNAME}" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
95+
* Trying 12.34.56.78..
96+
* Connected to 12.34.56.78 (12.34.56.78) port 80 (#0)
97+
> POST /v1/models/bert-large:predict HTTP/1.1
98+
> Host: bert-large.kfserving-test.example.com
99+
> User-Agent: curl/7.47.0
100+
> Accept: */*
101+
> Content-Length: 110
102+
> Content-Type: application/x-www-form-urlencoded
103+
>
104+
* upload completely sent off: 110 out of 110 bytes
105+
< HTTP/1.1 200 OK
106+
< content-length: 61
107+
< content-type: application/json; charset=UTF-8
108+
< date: Thu, 19 Nov 2020 20:00:18 GMT
109+
< server: istio-envoy
110+
< x-envoy-upstream-service-time: 3814
111+
<
112+
* Connection #0 to host 12.34.56.78 left intact
113+
{"predictions": "John F. Kennedy", "prob": 77.91852121017916}
114+
115+
So, we got JFK with 78% certainty, which is reasonable.
116+
117+
If we ask "who put Americans in space?"
118+
119+
$ curl -H "Host: ${SERVICE_HOSTNAME}" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
120+
{"predictions": "Project Mercury", "prob": 71.40910962568026}
121+
122+
We get another reasonable answer. BERT is considered to produce state-of-art results on a wide array of NLP tasks.
123+
124+
# Links
125+
126+
- https://developer.nvidia.com/nvidia-triton-inference-server
127+
- https://github.com/triton-inference-server/server
128+
- https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=python
129+
- https://gunicorn.org/
130+
- https://github.com/kubeflow/kfserving/tree/master/docs/samples
131+
132+
133+
---
134+
135+
[Back](../Readme.md)
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#
2+
# See the original for the latest version: https://github.com/kubeflow/kfserving/tree/master/docs/samples
3+
#
4+
apiVersion: "serving.kubeflow.org/v1alpha2"
5+
kind: "InferenceService"
6+
metadata:
7+
name: "bert-large"
8+
spec:
9+
default:
10+
transformer:
11+
custom:
12+
container:
13+
name: kfserving-container
14+
image: gcr.io/kubeflow-ci/kfserving/bert-transformer:latest
15+
resources:
16+
limits:
17+
cpu: "1"
18+
memory: 1Gi
19+
requests:
20+
cpu: "1"
21+
memory: 1Gi
22+
command:
23+
- "python"
24+
- "-m"
25+
- "bert_transformer"
26+
env:
27+
- name: STORAGE_URI
28+
value: "gs://kfserving-samples/models/triton/bert-transformer"
29+
predictor:
30+
triton:
31+
runtimeVersion: 20.03-py3
32+
resources:
33+
limits:
34+
cpu: "1"
35+
memory: 16Gi
36+
requests:
37+
cpu: "1"
38+
memory: 16Gi
39+
storageUri: "gs://kfserving-examples/models/triton/bert"
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
2+
name: "bert_tf_v2_large_fp16_128_v2"
3+
platform: "tensorflow_savedmodel"
4+
max_batch_size: 1
5+
input [
6+
{
7+
name: "unique_ids"
8+
data_type: TYPE_INT32
9+
dims: [ 1 ]
10+
reshape: { shape: [ ] }
11+
},
12+
{
13+
name: "segment_ids"
14+
data_type: TYPE_INT32
15+
dims: 128
16+
},
17+
{
18+
name: "input_ids"
19+
data_type: TYPE_INT32
20+
dims: 128
21+
},
22+
{
23+
name: "input_mask"
24+
data_type: TYPE_INT32
25+
dims: 128
26+
}
27+
]
28+
output [
29+
{
30+
name: "end_logits"
31+
data_type: TYPE_FP32
32+
dims: 128
33+
},
34+
{
35+
name: "start_logits"
36+
data_type: TYPE_FP32
37+
dims: 128
38+
}
39+
]
40+
41+
instance_group [
42+
{
43+
count: 1
44+
kind: KIND_GPU
45+
gpus: []
46+
}
47+
]
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#
2+
# See the original for the latest version: https://github.com/kubeflow/kfserving/tree/master/docs/samples
3+
#
4+
FROM python:3.7-slim
5+
RUN apt-get update \
6+
&& apt-get install -y wget \
7+
&& rm -rf /var/lib/apt/lists/*
8+
COPY bert_transformer bert_transformer/bert_transformer
9+
COPY setup.py bert_transformer/setup.py
10+
RUN pip install kfserving
11+
RUN wget https://github.com/triton-inference-server/server/releases/download/v1.11.0/v1.11.0_ubuntu1604.clients.tar.gz && tar -xvzf v1.11.0_ubuntu1604.clients.tar.gz
12+
RUN pip install python/tensorrtserver-1.11.0-py3-none-linux_x86_64.whl
13+
WORKDIR bert_transformer
14+
RUN pip install -e .
15+
ENTRYPOINT ["python", "-m", "bert_transformer"]

Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_tokenizer/bert_transformer/__init__.py

Whitespace-only changes.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Copyright 2020 kubeflow.org.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import kfserving
16+
import argparse
17+
from .bert_transformer import BertTransformer
18+
19+
DEFAULT_MODEL_NAME = "model"
20+
21+
parser = argparse.ArgumentParser(parents=[kfserving.kfserver.parser])
22+
parser.add_argument('--model_name', default=DEFAULT_MODEL_NAME,
23+
help='The name that the model is served under.')
24+
parser.add_argument('--predictor_host', help='The URL for the model predict function', required=True)
25+
26+
args, _ = parser.parse_known_args()
27+
28+
if __name__ == "__main__":
29+
transformer = BertTransformer(args.model_name, predictor_host=args.predictor_host)
30+
kfserver = kfserving.KFServer()
31+
kfserver.start(models=[transformer])
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Copyright 2020 kubeflow.org.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import kfserving
16+
from typing import Dict
17+
import numpy as np
18+
from . import tokenization
19+
from . import data_processing
20+
from tensorrtserver.api import *
21+
22+
23+
24+
class BertTransformer(kfserving.KFModel):
25+
def __init__(self, name: str, predictor_host: str):
26+
super().__init__(name)
27+
self.short_paragraph_text = "The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975."
28+
29+
self.predictor_host = predictor_host
30+
self.tokenizer = tokenization.FullTokenizer(vocab_file="/mnt/models/vocab.txt", do_lower_case=True)
31+
self.model_name = "bert_tf_v2_large_fp16_128_v2"
32+
self.model_version = -1
33+
self.protocol = ProtocolType.from_str('http')
34+
self.infer_ctx = None
35+
36+
def preprocess(self, inputs: Dict) -> Dict:
37+
self.doc_tokens = data_processing.convert_doc_tokens(self.short_paragraph_text)
38+
self.features = data_processing.convert_examples_to_features(self.doc_tokens, inputs["instances"][0], self.tokenizer, 128, 128, 64)
39+
return self.features
40+
41+
def predict(self, features: Dict) -> Dict:
42+
if not self.infer_ctx:
43+
self.infer_ctx = InferContext(self.predictor_host, self.protocol, self.model_name, self.model_version, http_headers='', verbose=True)
44+
45+
batch_size = 1
46+
unique_ids = np.int32([1])
47+
segment_ids = features["segment_ids"]
48+
input_ids = features["input_ids"]
49+
input_mask = features["input_mask"]
50+
result = self.infer_ctx.run({'unique_ids': (unique_ids,),
51+
'segment_ids': (segment_ids,),
52+
'input_ids': (input_ids,),
53+
'input_mask': (input_mask,)},
54+
{'end_logits': InferContext.ResultFormat.RAW,
55+
'start_logits': InferContext.ResultFormat.RAW}, batch_size)
56+
return result
57+
58+
def postprocess(self, result: Dict) -> Dict:
59+
end_logits = result['end_logits'][0]
60+
start_logits = result['start_logits'][0]
61+
n_best_size = 20
62+
63+
# The maximum length of an answer that can be generated. This is needed
64+
# because the start and end predictions are not conditioned on one another
65+
max_answer_length = 30
66+
67+
(prediction, nbest_json, scores_diff_json) = \
68+
data_processing.get_predictions(self.doc_tokens, self.features, start_logits, end_logits, n_best_size, max_answer_length)
69+
return {"predictions": prediction, "prob": nbest_json[0]['probability'] * 100.0}

0 commit comments

Comments
 (0)