diff --git a/modules/adding-a-tested-and-verified-runtime-for-the-single-model-serving-platform.adoc b/modules/adding-a-tested-and-verified-runtime-for-the-single-model-serving-platform.adoc index eb4391e60..7b3de6c74 100644 --- a/modules/adding-a-tested-and-verified-runtime-for-the-single-model-serving-platform.adoc +++ b/modules/adding-a-tested-and-verified-runtime-for-the-single-model-serving-platform.adoc @@ -4,16 +4,19 @@ = Adding a tested and verified model-serving runtime for the single-model serving platform -In addition to preinstalled and custom model-serving runtimes, you can also use {org-name} tested and verified model-serving runtimes such as the *NVIDIA Triton Inference Server* to support your needs. For more information about {org-name} tested and verified runtimes, see link:https://access.redhat.com/articles/7089743[Tested and verified runtimes for {productname-long}^]. +In addition to preinstalled and custom model-serving runtimes, you can also use {org-name} tested and verified model-serving runtimes to support your needs. For more information about {org-name} tested and verified runtimes, see link:https://access.redhat.com/articles/7089743[Tested and verified runtimes for {productname-long}^]. -You can use the {productname-long} dashboard to add and enable the *NVIDIA Triton Inference Server* or the *Seldon MLServer* runtime for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform. +You can use the {productname-long} dashboard to add and enable tested and verified runtimes for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform. [role='_abstract'] .Prerequisites + * You have logged in to {productname-short} as a user with {productname-short} administrator privileges. +* If you are deploying the IBM Z Accelerated for NVIDIA Triton Inference Server runtime, you have access to IBM cloud container registry to pull the container image. For more information about obtaining credentials to the IBM cloud container registry, see link:https://github.com/IBM/ibmz-accelerated-for-nvidia-triton-inference-server?tab=readme-ov-file#container[Inference Server container image^]. .Procedure + . From the {productname-short} dashboard, click *Settings* -> *Serving runtimes*. + The *Serving runtimes* page opens and shows the model-serving runtimes that are already installed and enabled. @@ -26,8 +29,133 @@ The *Serving runtimes* page opens and shows the model-serving runtimes that are . Click *Start from scratch*. -. Follow these steps to add the *NVIDIA Triton Inference Server* runtime: +. Follow these steps to add the *IBM Z Accelerated for NVIDIA Triton Inference Server* runtime: +.. If you selected the *REST* API protocol, enter or paste the following YAML code directly in the embedded editor. ++ +[source] +---- +apiVersion: serving.kserve.io/v1alpha1 +kind: ServingRuntime +metadata: + name: ibmz-triton-rest + labels: + opendatahub.io/dashboard: "true" +spec: + annotations: + prometheus.kserve.io/path: /metrics + prometheus.kserve.io/port: "8002" + containers: + - name: kserve-container + command: + - /bin/sh + - -c + args: + - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --http-port=8000 --grpc-port=8001 --metrics-port=8002 + image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:xxx + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + resources: + limits: + cpu: "2" + memory: 4Gi + requests: + cpu: "5" + memory: 4Gi + ports: + - containerPort: 8000 + protocol: TCP + protocolVersions: + - v2 + - grpc-v2 + supportedModelFormats: + - name: python + version: "1" + autoSelect: true + - name: onnx-mlir + version: "1" + autoSelect: true + - name: snapml + version: "1" + autoSelect: true + - name: pytorch + version: "1" + autoSelect: true +---- +.. If you selected the *gRPC* API protocol, enter or paste the following YAML code directly in the embedded editor. ++ +[source] +---- +apiVersion: serving.kserve.io/v1alpha1 +kind: ServingRuntime +metadata: + name: ibmz-triton-grpc + labels: + opendatahub.io/dashboard: "true" +spec: + annotations: + prometheus.kserve.io/path: /metrics + prometheus.kserve.io/port: "8002" + containers: + - name: kserve-container + command: + - /bin/sh + - -c + args: + - /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --grpc-port=8001 --http-port=8000 --metrics-port=8002 + image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:xxx + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + resources: + limits: + cpu: "2" + memory: 4Gi + requests: + cpu: "5" + memory: 4Gi + ports: + - containerPort: 8001 + name: grpc + protocol: TCP + volumeMounts: + - mountPath: /dev/shm + name: shm + protocolVersions: + - v2 + - grpc-v2 + supportedModelFormats: + - name: python + version: "1" + autoSelect: true + - name: onnx-mlir + version: "1" + autoSelect: true + - name: snapml + version: "1" + autoSelect: true + - name: pytorch + version: "1" + autoSelect: true +volumes: + - emptyDir: null + medium: Memory + sizeLimit: 2Gi + name: shm +---- + +. Follow these steps to add the *NVIDIA Triton Inference Server* runtime: .. If you selected the *REST* API protocol, enter or paste the following YAML code directly in the embedded editor. + [source] @@ -162,6 +290,7 @@ volumes: sizeLimit: 2Gi name: shm ---- + . Follow these steps to add the *Seldon MLServer* runtime: .. If you selected the *REST* API protocol, enter or paste the following YAML code directly in the embedded editor. + @@ -360,6 +489,7 @@ The *Serving runtimes* page opens and shows the updated list of runtimes that ar . Optional: To edit the runtime, click the action menu (⋮) and select *Edit*. .Verification + * The model-serving runtime that you added is shown in an enabled state on the *Serving runtimes* page. [role='_additional-resources'] diff --git a/modules/customizable-model-serving-runtime-parameters.adoc b/modules/customizable-model-serving-runtime-parameters.adoc index d575588e2..c4a6ff29e 100644 --- a/modules/customizable-model-serving-runtime-parameters.adoc +++ b/modules/customizable-model-serving-runtime-parameters.adoc @@ -16,6 +16,7 @@ For more information about parameters for each of the supported serving runtimes link:https://github.com/opendatahub-io/caikit-nlp?tab=readme-ov-file#configuration[Caikit NLP: Configuration] + link:https://github.com/IBM/text-generation-inference?tab=readme-ov-file#model-configuration[TGIS: Model configuration] | Caikit Standalone ServingRuntime for KServe | link:https://github.com/opendatahub-io/caikit-nlp?tab=readme-ov-file#configuration[Caikit NLP: Configuration] +| IBM Z Accelerated for NVIDIA Triton Inference Server | link:https://ibm.github.io/ai-on-z-101/tritonis/[Triton Inference Server for Linux on Z environments] | NVIDIA Triton Inference Server | link:https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tensorrtllm_backend/docs/model_config.html?#model-configuration[NVIDIA Triton Inference Server: Model Parameters] |OpenVINO Model Server | link:https://docs.openvino.ai/2024/openvino-workflow/model-server/ovms_docs_dynamic_input.html[OpenVINO Model Server Features: Dynamic Input Parameters] | Seldon MLServer | link:https://mlserver.readthedocs.io/en/stable/reference/model-settings.html[MLServer Documentation: Model Settings] diff --git a/modules/ref-tested-verified-runtimes.adoc b/modules/ref-tested-verified-runtimes.adoc index ce09f5761..ce9691ee5 100644 --- a/modules/ref-tested-verified-runtimes.adoc +++ b/modules/ref-tested-verified-runtimes.adoc @@ -23,6 +23,7 @@ endif::[] |=== | Name | Description | Exported model format +| IBM Z Accelerated for NVIDIA Triton Inference Server | An open-source AI inference server that standardizes model deployment and execution, delivering streamlined, high‑performance inference at scale | Python, ONNX-MLIR, Snap ML (C++), PyTorch | NVIDIA Triton Inference Server | An open-source inference-serving software for fast and scalable AI in applications. | TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more | Seldon MLServer | An open-source inference server designed to simplify the deployment of machine learning models. | Scikit-Learn (sklearn), XGBoost, LightGBM, CatBoost, HuggingFace and MLflow @@ -33,6 +34,7 @@ endif::[] |=== | Name | Default protocol | Additional protocol | Model mesh support | Single node OpenShift support | Deployment mode +| IBM Z Accelerated for NVIDIA Triton Inference Server | gRPC | REST | No | Yes | Raw | NVIDIA Triton Inference Server | gRPC | REST | Yes | Yes | Raw and serverless | Seldon MLServer | gRPC | REST | No | Yes | Raw and serverless @@ -54,3 +56,4 @@ ifndef::upstream[] * link:{rhoaidocshome}{default-format-url}/serving_models/serving-large-models_serving-large-models#inference-endpoints_serving-large-models[Inference endpoints] endif::[] +* link:https://github.com/IBM/ibmz-accelerated-for-nvidia-triton-inference-server?tab=readme-ov-file#rest-apis-[Using the IBM Z Accelerated for NVIDIA Triton Inference Server Container Image]