Azure-Samples
diff --git a/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/Readme.md‎
Lines changed: 9 additions & 0 deletions b/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/Readme.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/Readme.md‎
Lines changed: 135 additions & 0 deletions b/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/Readme.md‎
Lines changed: 135 additions & 0 deletions
diff --git a/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert.yaml‎
Lines changed: 39 additions & 0 deletions b/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert.yaml‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_large/config.pbtxt‎
Lines changed: 47 additions & 0 deletions b/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_large/config.pbtxt‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_tokenizer/Dockerfile‎
Lines changed: 15 additions & 0 deletions b/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_tokenizer/Dockerfile‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_tokenizer/bert_transformer/__init__.py‎ b/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_tokenizer/bert_transformer/__init__.py‎
diff --git a/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_tokenizer/bert_transformer/__main__.py‎
Lines changed: 31 additions & 0 deletions b/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_tokenizer/bert_transformer/__main__.py‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_tokenizer/bert_transformer/bert_transformer.py‎
Lines changed: 69 additions & 0 deletions b/‎Research/kubeflow-on-azure-stack-lab/04-KFServing/triton/triton_bert_tokenizer/bert_transformer/bert_transformer.py‎
Lines changed: 69 additions & 0 deletions
@@ -97,6 +97,14 @@ Please see a separate sub-page, [KFServing models saved in ONNX format](onnx.md)
 
 ---
 
+### KFServing using Triton
+
+Triton is a high-performance inferencing server from NVIDIA
+
+Please see a separate sub-page, [KFServing using Triton](triton/Readme.md)
+
+---
+
 ### KFServing model SKLearn Iris model
 
 Let us walk through a demo for SKLearn, which is similar for other ML frameworks.
@@ -219,6 +227,7 @@ See https://github.com/kubeflow/kfserving for more details.
 - [https://www.kubeflow.org/docs/components/serving/kfserving/](https://www.kubeflow.org/docs/components/serving/kfserving/)
 - [Kafka Event Source](https://github.com/knative/eventing-contrib/tree/master/kafka/source)
 - [knative client](https://github.com/knative/client)
+- https://developer.nvidia.com/nvidia-triton-inference-server
 
 ---
 
 
@@ -0,0 +1,135 @@
+# KFServing using Triton
+
+[Nvidia Inference Server Triton](https://developer.nvidia.com/nvidia-triton-inference-server) is a high-performing
+batch-inferencing deployment we can start using KFServing.
+
+We will be following examples from KFServing Github repository, and encourage you to look at
+other examples if you face any problems: https://github.com/kubeflow/kfserving/tree/master/docs/samples
+
+BERT(Bidirectional Embedding Representations from Transformers), is working on Natural Language Processing tasks.
+
+# Pre-requisites
+
+You need to have Kubeflow version 1.2 or later, which should have KFServing version 0.4 or later.
+
+Earlier name used for `Triton` was `TensorRT`, it may not be compatible with what we will be demoing.
+
+# Preparing the environment
+
+We have to make some changes to our environment.
+
+We need to skip tag resolution for nvcr.io:
+
+    $ kubectl patch cm config-deployment --patch '{"data":{"registriesSkippingTagResolving":"nvcr.io"}}' -n knative-serving
+
+And increase the timeout for image pulling, because BERT model we will use is large
+
+    $ kubectl patch cm config-deployment --patch '{"data":{"progressDeadline": "600s"}}' -n knative-serving
+
+# (Optional) Extending `kfserving.KFModel`
+
+You can define your own `pre/postprocess` and `predict` functions, and prepare the image for the transformer.
+
+For the sake of simplicity we will use a pre-built image gcr.io/kubeflow-ci/kfserving/bert-transformer:latest.
+
+If you want to build your own you can do this and update the image name in the .yaml:
+
+    $ cd triton_bert_tokenizer
+    $ docker build -t rollingstone/bert_transformer:latest . --rm
+
+(replacing `rollingstone` with your own DockerHub account name  or an ACR of your choosing)
+
+# Deploying the inferenceservice for Triton
+
+If you remember, in our environment we created a separate namespace `kfserving-test` for inferencing,
+we will deploy the Triton deployment into it, like so:
+
+    $ kubectl create -f triton_bert.yaml -n kfserving-test
+    inferenceservice.serving.kubeflow.org/bert-large created
+
+You will need to wait until the `inferencingservice` will become `READY`. For troubleshooting
+you can look at the health of the pods.
+
+    $ kubectl get inferenceservice -n kfserving-test
+    NAME         URL                                            READY   DEFAULT TRAFFIC   CANARY TRAFFIC   AGE
+    bert-large   http://bert-large.kfserving-test.example.com   True    100                                1m
+
+In a few minutes you should see something like this:
+
+    $ kubectl get revision -l serving.kubeflow.org/inferenceservice=bert-large -n kfserving-test
+    NAME                                   CONFIG NAME                      K8S SERVICE NAME                       GENERATION   READY   REASON
+    bert-large-predictor-default-9jcrq     bert-large-predictor-default     bert-large-predictor-default-9jcrq     1            True
+    bert-large-transformer-default-hvwjq   bert-large-transformer-default   bert-large-transformer-default-hvwjq   1            True
+
+# Running inferencing
+
+If you remember, BERT is working on NLP. The input we will give it will be a text question.
+
+    $ cat triton_input.json
+    {
+    "instances": [
+        "What President is credited with the original notion of putting Americans in space?"
+    ]
+    }
+
+As with previous examples, we need to have the `INGRESS_HOST` and `INGRESS_PORT` defined. We used the following, but
+in your environment it could be different depending on the flavor of your Kubernetes cluster and i/o layers within it:
+
+    $ INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway  -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
+    $ INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway  -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
+
+And let us set the following:
+
+    $ MODEL_NAME=bert-large
+    $ INPUT_PATH=@./triton_input.json
+    $ SERVICE_HOSTNAME=$(kubectl get inferenceservices -n kfserving-test bert-large -o jsonpath='{.status.url}' | cut -d "/" -f 3)
+
+You should see the service `Alive`:
+
+    $ curl -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}
+    Alive
+
+And now we can run the web requests for the inferencing server we created:
+
+    $ curl -v -H "Host: ${SERVICE_HOSTNAME}" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
+    *   Trying 12.34.56.78..
+    * Connected to 12.34.56.78 (12.34.56.78) port 80 (#0)
+    > POST /v1/models/bert-large:predict HTTP/1.1
+    > Host: bert-large.kfserving-test.example.com
+    > User-Agent: curl/7.47.0
+    > Accept: */*
+    > Content-Length: 110
+    > Content-Type: application/x-www-form-urlencoded
+    >
+    * upload completely sent off: 110 out of 110 bytes
+    < HTTP/1.1 200 OK
+    < content-length: 61
+    < content-type: application/json; charset=UTF-8
+    < date: Thu, 19 Nov 2020 20:00:18 GMT
+    < server: istio-envoy
+    < x-envoy-upstream-service-time: 3814
+    <
+    * Connection #0 to host 12.34.56.78 left intact
+    {"predictions": "John F. Kennedy", "prob": 77.91852121017916}
+
+So, we got JFK with 78% certainty, which is reasonable.
+
+If we ask "who put Americans in space?"
+
+    $ curl -H "Host: ${SERVICE_HOSTNAME}" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
+    {"predictions": "Project Mercury", "prob": 71.40910962568026}
+
+We get another reasonable answer. BERT is considered to produce state-of-art results on a wide array of NLP tasks.
+
+# Links
+
+- https://developer.nvidia.com/nvidia-triton-inference-server
+- https://github.com/triton-inference-server/server
+- https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=python
+- https://gunicorn.org/
+- https://github.com/kubeflow/kfserving/tree/master/docs/samples
+
+
+---
+
+[Back](../Readme.md)
@@ -0,0 +1,39 @@
+#
+# See the original for the latest version: https://github.com/kubeflow/kfserving/tree/master/docs/samples
+#
+apiVersion: "serving.kubeflow.org/v1alpha2"
+kind: "InferenceService"
+metadata:
+  name: "bert-large"
+spec:
+  default:
+    transformer:
+      custom:
+        container:
+          name: kfserving-container      
+          image: gcr.io/kubeflow-ci/kfserving/bert-transformer:latest
+          resources:
+            limits:
+              cpu: "1" 
+              memory: 1Gi
+            requests:
+              cpu: "1" 
+              memory: 1Gi
+          command:
+            - "python"
+            - "-m"
+            - "bert_transformer"
+          env:
+            - name: STORAGE_URI
+              value: "gs://kfserving-samples/models/triton/bert-transformer"
+    predictor:
+      triton:
+        runtimeVersion: 20.03-py3
+        resources:
+          limits:
+            cpu: "1"
+            memory: 16Gi
+          requests:
+            cpu: "1"
+            memory: 16Gi
+        storageUri: "gs://kfserving-examples/models/triton/bert"
@@ -0,0 +1,47 @@
+
+name: "bert_tf_v2_large_fp16_128_v2"
+platform: "tensorflow_savedmodel"
+max_batch_size: 1
+input [
+    {
+        name: "unique_ids"
+        data_type: TYPE_INT32
+        dims: [ 1 ]
+        reshape: { shape: [ ] }
+    },
+    {
+        name: "segment_ids"
+        data_type: TYPE_INT32
+        dims: 128
+    },
+    {
+        name: "input_ids"
+        data_type: TYPE_INT32
+        dims: 128
+    },
+    {
+        name: "input_mask"
+        data_type: TYPE_INT32
+        dims: 128
+    }
+    ]
+    output [
+    {
+        name: "end_logits"
+        data_type: TYPE_FP32
+        dims: 128
+    },
+    {
+        name: "start_logits"
+        data_type: TYPE_FP32
+        dims: 128
+    }
+]
+
+instance_group [
+    {
+        count: 1
+        kind: KIND_GPU
+        gpus: []
+    }
+]
@@ -0,0 +1,15 @@
+#
+# See the original for the latest version: https://github.com/kubeflow/kfserving/tree/master/docs/samples
+#
+FROM python:3.7-slim
+RUN  apt-get update \
+  && apt-get install -y wget \
+  && rm -rf /var/lib/apt/lists/*
+COPY bert_transformer bert_transformer/bert_transformer
+COPY setup.py bert_transformer/setup.py
+RUN pip install kfserving
+RUN wget https://github.com/triton-inference-server/server/releases/download/v1.11.0/v1.11.0_ubuntu1604.clients.tar.gz && tar -xvzf v1.11.0_ubuntu1604.clients.tar.gz
+RUN pip install python/tensorrtserver-1.11.0-py3-none-linux_x86_64.whl
+WORKDIR bert_transformer
+RUN pip install -e .
+ENTRYPOINT ["python", "-m", "bert_transformer"] 
@@ -0,0 +1,31 @@
+# Copyright 2020 kubeflow.org.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import kfserving
+import argparse
+from .bert_transformer import BertTransformer
+
+DEFAULT_MODEL_NAME = "model"
+
+parser = argparse.ArgumentParser(parents=[kfserving.kfserver.parser])
+parser.add_argument('--model_name', default=DEFAULT_MODEL_NAME,
+                    help='The name that the model is served under.')
+parser.add_argument('--predictor_host', help='The URL for the model predict function', required=True)
+
+args, _ = parser.parse_known_args()
+
+if __name__ == "__main__":
+    transformer = BertTransformer(args.model_name, predictor_host=args.predictor_host)
+    kfserver = kfserving.KFServer()
+    kfserver.start(models=[transformer])
@@ -0,0 +1,69 @@
+# Copyright 2020 kubeflow.org.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import kfserving
+from typing import Dict
+import numpy as np
+from . import tokenization
+from . import data_processing
+from tensorrtserver.api import *
+
+
+
+class BertTransformer(kfserving.KFModel):
+    def __init__(self, name: str, predictor_host: str):
+        super().__init__(name)
+        self.short_paragraph_text = "The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975."
+
+        self.predictor_host = predictor_host
+        self.tokenizer = tokenization.FullTokenizer(vocab_file="/mnt/models/vocab.txt", do_lower_case=True)
+        self.model_name = "bert_tf_v2_large_fp16_128_v2"
+        self.model_version = -1
+        self.protocol = ProtocolType.from_str('http')
+        self.infer_ctx = None
+
+    def preprocess(self, inputs: Dict) -> Dict:
+        self.doc_tokens = data_processing.convert_doc_tokens(self.short_paragraph_text)
+        self.features = data_processing.convert_examples_to_features(self.doc_tokens, inputs["instances"][0], self.tokenizer, 128, 128, 64)
+        return self.features
+
+    def predict(self, features: Dict) -> Dict:
+        if not self.infer_ctx:
+            self.infer_ctx = InferContext(self.predictor_host, self.protocol, self.model_name, self.model_version, http_headers='', verbose=True)
+
+        batch_size = 1
+        unique_ids = np.int32([1])
+        segment_ids = features["segment_ids"]
+        input_ids = features["input_ids"]
+        input_mask = features["input_mask"] 
+        result = self.infer_ctx.run({'unique_ids': (unique_ids,),
+                                     'segment_ids': (segment_ids,),
+                                     'input_ids': (input_ids,),
+                                     'input_mask': (input_mask,)},
+                                    {'end_logits': InferContext.ResultFormat.RAW,
+                                     'start_logits': InferContext.ResultFormat.RAW}, batch_size)
+        return result 
+    
+    def postprocess(self, result: Dict) -> Dict:
+        end_logits = result['end_logits'][0]
+        start_logits = result['start_logits'][0]
+        n_best_size = 20
+
+        # The maximum length of an answer that can be generated. This is needed 
+        #  because the start and end predictions are not conditioned on one another
+        max_answer_length = 30
+
+        (prediction, nbest_json, scores_diff_json) = \
+           data_processing.get_predictions(self.doc_tokens, self.features, start_logits, end_logits, n_best_size, max_answer_length)
+        return {"predictions": prediction, "prob": nbest_json[0]['probability'] * 100.0}