This document demonstrates, running fast transformers HuggingFace BERT example with Torchserve in kubernetes setup.
Refer: FasterTransformer_HuggingFace_Bert
Once the cluster and the PVCs are ready, we can generate MAR file.
Follow steps from here to generate MAR file
docker cp <container-id>:/workspace/serve/examples/FasterTransformer_HuggingFace_Bert/BERTSeqClassification.mar ./BERTSeqClassification.marinference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
NUM_WORKERS=1
number_of_gpu=1
install_py_dep_per_model=true
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/shared/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"bert":{"1.0":{"defaultVersion":true,"marName":"BERTSeqClassification.mar","minWorkers":2,"maxWorkers":3,"batchSize":1,"maxBatchDelay":100,"responseTimeout":120}}}}kubectl exec --tty pod/model-store-pod -- mkdir /pv/model-store/
kubectl cp BERTSeqClassification.mar model-store-pod:/pv/model-store/BERTSeqClassification.mar
kubectl exec --tty pod/model-store-pod -- mkdir /pv/config/
kubectl cp config.properties model-store-pod:/pv/config/config.properties- Clone Torchserve Repo
git clone https://github.com/pytorch/serve.git
cd serve/docker- Modify Python and Pip paths in
Dockerfileas below
sed -i 's#/usr/bin/python3#/opt/conda/bin/python3#g' Dockerfile
sed -i 's#/usr/local/bin/pip3#/opt/conda/bin/pip3#g' Dockerfile- Change GPU check in
Dockerfilefor nvcr.io image
sed -i 's#grep -q "cuda:"#grep -q "nvidia:"#g' Dockerfile- Add
transformers==2.5.1toDockerfile
sed -i 's#pip install --no-cache-dir captum torchtext torchserve torch-model-archiver#& transformers==2.5.1#g' Dockerfile-
Build image
DOCKER_BUILDKIT=1 docker build -file Dockerfile -t <image-name> --build-arg BASE_IMAGE=nvcr.io/nvidia/pytorch:20.12-py3 --build-arg CUDA_VERSION=cu102 .- Push image
docker push <image-name>- Navigate to kubernetes TS Helm package folder
cd ../kubernetes/Helm- Modify values.yaml with image and memory
torchserve_image: <image build in previous step>
namespace: torchserve
torchserve:
management_port: 8081
inference_port: 8080
metrics_port: 8082
pvd_mount: /home/model-server/shared/
n_gpu: 1
n_cpu: 4
memory_limit: 32Gi
memory_request: 32Gi
deployment:
replicas: 1
persitant_volume:
size: 1Gi- Install TS
helm install torchserve .- Check TS installation
Kubectl get pods -n default
Kubectl logs <pod-name> -n default- Start a shell session into the TS pod
kubectl exec -it <pod-name> -- bash- Create input file
Sample_text_captum_input.txt
{
"text": "Bloomberg has decided to publish a new report on the global economy.",
"target": 1
}- Run inference
curl -X POST http://127.0.0.1:8080/predictions/bert -T ../Huggingface_Transformers/Seq_classification_artifacts/sample_text_captum_input.txt