|
| 1 | +# KFServing using Triton |
| 2 | + |
| 3 | +[Nvidia Inference Server Triton](https://developer.nvidia.com/nvidia-triton-inference-server) is a high-performing |
| 4 | +batch-inferencing deployment we can start using KFServing. |
| 5 | + |
| 6 | +We will be following examples from KFServing Github repository, and encourage you to look at |
| 7 | +other examples if you face any problems: https://github.com/kubeflow/kfserving/tree/master/docs/samples |
| 8 | + |
| 9 | +BERT(Bidirectional Embedding Representations from Transformers), is working on Natural Language Processing tasks. |
| 10 | + |
| 11 | +# Pre-requisites |
| 12 | + |
| 13 | +You need to have Kubeflow version 1.2 or later, which should have KFServing version 0.4 or later. |
| 14 | + |
| 15 | +Earlier name used for `Triton` was `TensorRT`, it may not be compatible with what we will be demoing. |
| 16 | + |
| 17 | +# Preparing the environment |
| 18 | + |
| 19 | +We have to make some changes to our environment. |
| 20 | + |
| 21 | +We need to skip tag resolution for nvcr.io: |
| 22 | + |
| 23 | + $ kubectl patch cm config-deployment --patch '{"data":{"registriesSkippingTagResolving":"nvcr.io"}}' -n knative-serving |
| 24 | + |
| 25 | +And increase the timeout for image pulling, because BERT model we will use is large |
| 26 | + |
| 27 | + $ kubectl patch cm config-deployment --patch '{"data":{"progressDeadline": "600s"}}' -n knative-serving |
| 28 | + |
| 29 | +# (Optional) Extending `kfserving.KFModel` |
| 30 | + |
| 31 | +You can define your own `pre/postprocess` and `predict` functions, and prepare the image for the transformer. |
| 32 | + |
| 33 | +For the sake of simplicity we will use a pre-built image gcr.io/kubeflow-ci/kfserving/bert-transformer:latest. |
| 34 | + |
| 35 | +If you want to build your own you can do this and update the image name in the .yaml: |
| 36 | + |
| 37 | + $ cd triton_bert_tokenizer |
| 38 | + $ docker build -t rollingstone/bert_transformer:latest . --rm |
| 39 | + |
| 40 | +(replacing `rollingstone` with your own DockerHub account name or an ACR of your choosing) |
| 41 | + |
| 42 | +# Deploying the inferenceservice for Triton |
| 43 | + |
| 44 | +If you remember, in our environment we created a separate namespace `kfserving-test` for inferencing, |
| 45 | +we will deploy the Triton deployment into it, like so: |
| 46 | + |
| 47 | + $ kubectl create -f triton_bert.yaml -n kfserving-test |
| 48 | + inferenceservice.serving.kubeflow.org/bert-large created |
| 49 | + |
| 50 | +You will need to wait until the `inferencingservice` will become `READY`. For troubleshooting |
| 51 | +you can look at the health of the pods. |
| 52 | + |
| 53 | + $ kubectl get inferenceservice -n kfserving-test |
| 54 | + NAME URL READY DEFAULT TRAFFIC CANARY TRAFFIC AGE |
| 55 | + bert-large http://bert-large.kfserving-test.example.com True 100 1m |
| 56 | + |
| 57 | +In a few minutes you should see something like this: |
| 58 | + |
| 59 | + $ kubectl get revision -l serving.kubeflow.org/inferenceservice=bert-large -n kfserving-test |
| 60 | + NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON |
| 61 | + bert-large-predictor-default-9jcrq bert-large-predictor-default bert-large-predictor-default-9jcrq 1 True |
| 62 | + bert-large-transformer-default-hvwjq bert-large-transformer-default bert-large-transformer-default-hvwjq 1 True |
| 63 | + |
| 64 | +# Running inferencing |
| 65 | + |
| 66 | +If you remember, BERT is working on NLP. The input we will give it will be a text question. |
| 67 | + |
| 68 | + $ cat triton_input.json |
| 69 | + { |
| 70 | + "instances": [ |
| 71 | + "What President is credited with the original notion of putting Americans in space?" |
| 72 | + ] |
| 73 | + } |
| 74 | + |
| 75 | +As with previous examples, we need to have the `INGRESS_HOST` and `INGRESS_PORT` defined. We used the following, but |
| 76 | +in your environment it could be different depending on the flavor of your Kubernetes cluster and i/o layers within it: |
| 77 | + |
| 78 | + $ INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}') |
| 79 | + $ INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}') |
| 80 | + |
| 81 | +And let us set the following: |
| 82 | + |
| 83 | + $ MODEL_NAME=bert-large |
| 84 | + $ INPUT_PATH=@./triton_input.json |
| 85 | + $ SERVICE_HOSTNAME=$(kubectl get inferenceservices -n kfserving-test bert-large -o jsonpath='{.status.url}' | cut -d "/" -f 3) |
| 86 | + |
| 87 | +You should see the service `Alive`: |
| 88 | + |
| 89 | + $ curl -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT} |
| 90 | + Alive |
| 91 | + |
| 92 | +And now we can run the web requests for the inferencing server we created: |
| 93 | + |
| 94 | + $ curl -v -H "Host: ${SERVICE_HOSTNAME}" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict |
| 95 | + * Trying 12.34.56.78.. |
| 96 | + * Connected to 12.34.56.78 (12.34.56.78) port 80 (#0) |
| 97 | + > POST /v1/models/bert-large:predict HTTP/1.1 |
| 98 | + > Host: bert-large.kfserving-test.example.com |
| 99 | + > User-Agent: curl/7.47.0 |
| 100 | + > Accept: */* |
| 101 | + > Content-Length: 110 |
| 102 | + > Content-Type: application/x-www-form-urlencoded |
| 103 | + > |
| 104 | + * upload completely sent off: 110 out of 110 bytes |
| 105 | + < HTTP/1.1 200 OK |
| 106 | + < content-length: 61 |
| 107 | + < content-type: application/json; charset=UTF-8 |
| 108 | + < date: Thu, 19 Nov 2020 20:00:18 GMT |
| 109 | + < server: istio-envoy |
| 110 | + < x-envoy-upstream-service-time: 3814 |
| 111 | + < |
| 112 | + * Connection #0 to host 12.34.56.78 left intact |
| 113 | + {"predictions": "John F. Kennedy", "prob": 77.91852121017916} |
| 114 | + |
| 115 | +So, we got JFK with 78% certainty, which is reasonable. |
| 116 | + |
| 117 | +If we ask "who put Americans in space?" |
| 118 | + |
| 119 | + $ curl -H "Host: ${SERVICE_HOSTNAME}" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict |
| 120 | + {"predictions": "Project Mercury", "prob": 71.40910962568026} |
| 121 | + |
| 122 | +We get another reasonable answer. BERT is considered to produce state-of-art results on a wide array of NLP tasks. |
| 123 | + |
| 124 | +# Links |
| 125 | + |
| 126 | +- https://developer.nvidia.com/nvidia-triton-inference-server |
| 127 | +- https://github.com/triton-inference-server/server |
| 128 | +- https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=python |
| 129 | +- https://gunicorn.org/ |
| 130 | +- https://github.com/kubeflow/kfserving/tree/master/docs/samples |
| 131 | + |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +[Back](../Readme.md) |
0 commit comments