Skip to content

RHOAIENG-28660: Updates to T&V runtimes for IBM Z Triton #879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,19 @@

= Adding a tested and verified model-serving runtime for the single-model serving platform

In addition to preinstalled and custom model-serving runtimes, you can also use {org-name} tested and verified model-serving runtimes such as the *NVIDIA Triton Inference Server* to support your needs. For more information about {org-name} tested and verified runtimes, see link:https://access.redhat.com/articles/7089743[Tested and verified runtimes for {productname-long}^].
In addition to preinstalled and custom model-serving runtimes, you can also use {org-name} tested and verified model-serving runtimes to support your needs. For more information about {org-name} tested and verified runtimes, see link:https://access.redhat.com/articles/7089743[Tested and verified runtimes for {productname-long}^].

You can use the {productname-long} dashboard to add and enable the *NVIDIA Triton Inference Server* or the *Seldon MLServer* runtime for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform.
You can use the {productname-long} dashboard to add and enable tested and verified runtimes for the single-model serving platform. You can then choose the runtime when you deploy a model on the single-model serving platform.

[role='_abstract']

.Prerequisites

* You have logged in to {productname-short} as a user with {productname-short} administrator privileges.
* If you are deploying the IBM Z Accelerated for NVIDIA Triton Inference Server runtime, you have access to IBM cloud container registry to pull the container image. For more information about obtaining credentials to the IBM cloud container registry, see link:https://github.com/IBM/ibmz-accelerated-for-nvidia-triton-inference-server?tab=readme-ov-file#container[Inference Server container image^].

.Procedure

. From the {productname-short} dashboard, click *Settings* -> *Serving runtimes*.
+
The *Serving runtimes* page opens and shows the model-serving runtimes that are already installed and enabled.
Expand All @@ -26,8 +29,133 @@ The *Serving runtimes* page opens and shows the model-serving runtimes that are

. Click *Start from scratch*.

. Follow these steps to add the *NVIDIA Triton Inference Server* runtime:
. Follow these steps to add the *IBM Z Accelerated for NVIDIA Triton Inference Server* runtime:
.. If you selected the *REST* API protocol, enter or paste the following YAML code directly in the embedded editor.
+
[source]
----
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: ibmz-triton-rest
labels:
opendatahub.io/dashboard: "true"
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: "8002"
containers:
- name: kserve-container
command:
- /bin/sh
- -c
args:
- /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --http-port=8000 --grpc-port=8001 --metrics-port=8002
image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:xxx
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
resources:
limits:
cpu: "2"
memory: 4Gi
requests:
cpu: "5"
memory: 4Gi
Comment on lines +64 to +69
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

requests.cpu greater than limits.cpu – scheduling will fail.

Kubernetes requires requests <= limits.
Swap the values (or raise limits) to avoid spec.containers[].resources.limits.cpu: Invalid value.

-  limits:
-    cpu: "2"
+  limits:
+    cpu: "5"
   requests:
-    cpu: "5"
+    cpu: "2"

Apply to both REST (lines 65-70) and gRPC (124-129) variants.

Also applies to: 124-129

🤖 Prompt for AI Agents
In
modules/adding-a-tested-and-verified-runtime-for-the-single-model-serving-platform.adoc
at lines 65-70 and 124-129, the cpu resource requests value is greater than the
limits value, which causes Kubernetes scheduling to fail. Fix this by ensuring
that the cpu requests value is less than or equal to the cpu limits value,
either by swapping the values or increasing the limits to be at least as large
as the requests.

ports:
- containerPort: 8000
protocol: TCP
protocolVersions:
- v2
- grpc-v2
supportedModelFormats:
- name: python
version: "1"
autoSelect: true
- name: onnx-mlir
version: "1"
autoSelect: true
- name: snapml
version: "1"
autoSelect: true
- name: pytorch
version: "1"
autoSelect: true
----

.. If you selected the *gRPC* API protocol, enter or paste the following YAML code directly in the embedded editor.
+
[source]
----
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: ibmz-triton-grpc
labels:
opendatahub.io/dashboard: "true"
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: "8002"
containers:
- name: kserve-container
command:
- /bin/sh
- -c
args:
- /opt/tritonserver/bin/tritonserver --model-repository=/mnt/models --grpc-port=8001 --http-port=8000 --metrics-port=8002
image: icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:xxx
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
resources:
limits:
cpu: "2"
memory: 4Gi
requests:
cpu: "5"
memory: 4Gi
ports:
- containerPort: 8001
name: grpc
protocol: TCP
volumeMounts:
- mountPath: /dev/shm
name: shm
protocolVersions:
- v2
- grpc-v2
supportedModelFormats:
- name: python
version: "1"
autoSelect: true
- name: onnx-mlir
version: "1"
autoSelect: true
- name: snapml
version: "1"
autoSelect: true
- name: pytorch
version: "1"
autoSelect: true
volumes:
- emptyDir: null
medium: Memory
sizeLimit: 2Gi
name: shm
Comment on lines +152 to +155
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Volume definition malformed – medium and sizeLimit must nest under emptyDir.

Current structure:

- emptyDir: null
  medium: Memory
  sizeLimit: 2Gi
  name: shm

Valid structure:

- emptyDir: null
- medium: Memory
- sizeLimit: 2Gi
+emptyDir:
+  medium: Memory
+  sizeLimit: 2Gi

Without this fix the manifest fails to parse.

🤖 Prompt for AI Agents
In
modules/adding-a-tested-and-verified-runtime-for-the-single-model-serving-platform.adoc
around lines 154 to 157, the volume definition is malformed because the keys
'medium' and 'sizeLimit' are not nested under 'emptyDir'. To fix this,
restructure the YAML so that 'emptyDir' is an object containing 'medium' and
'sizeLimit' keys, instead of setting 'emptyDir' to null and placing those keys
at the same level. This will ensure the manifest parses correctly.

----

. Follow these steps to add the *NVIDIA Triton Inference Server* runtime:
.. If you selected the *REST* API protocol, enter or paste the following YAML code directly in the embedded editor.
+
[source]
Expand Down Expand Up @@ -162,6 +290,7 @@ volumes:
sizeLimit: 2Gi
name: shm
----

. Follow these steps to add the *Seldon MLServer* runtime:
.. If you selected the *REST* API protocol, enter or paste the following YAML code directly in the embedded editor.
+
Expand Down Expand Up @@ -360,6 +489,7 @@ The *Serving runtimes* page opens and shows the updated list of runtimes that ar
. Optional: To edit the runtime, click the action menu (&#8942;) and select *Edit*.

.Verification

* The model-serving runtime that you added is shown in an enabled state on the *Serving runtimes* page.

[role='_additional-resources']
Expand Down
1 change: 1 addition & 0 deletions modules/customizable-model-serving-runtime-parameters.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ For more information about parameters for each of the supported serving runtimes
link:https://github.com/opendatahub-io/caikit-nlp?tab=readme-ov-file#configuration[Caikit NLP: Configuration] +
link:https://github.com/IBM/text-generation-inference?tab=readme-ov-file#model-configuration[TGIS: Model configuration]
| Caikit Standalone ServingRuntime for KServe | link:https://github.com/opendatahub-io/caikit-nlp?tab=readme-ov-file#configuration[Caikit NLP: Configuration]
| IBM Z Accelerated for NVIDIA Triton Inference Server | link:https://ibm.github.io/ai-on-z-101/tritonis/[Triton Inference Server for Linux on Z environments]
| NVIDIA Triton Inference Server | link:https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tensorrtllm_backend/docs/model_config.html?#model-configuration[NVIDIA Triton Inference Server: Model Parameters]
|OpenVINO Model Server | link:https://docs.openvino.ai/2024/openvino-workflow/model-server/ovms_docs_dynamic_input.html[OpenVINO Model Server Features: Dynamic Input Parameters]
| Seldon MLServer | link:https://mlserver.readthedocs.io/en/stable/reference/model-settings.html[MLServer Documentation: Model Settings]
Expand Down
3 changes: 3 additions & 0 deletions modules/ref-tested-verified-runtimes.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ endif::[]
|===
| Name | Description | Exported model format

| IBM Z Accelerated for NVIDIA Triton Inference Server | An open-source AI inference server that standardizes model deployment and execution, delivering streamlined, high‑performance inference at scale | Python, ONNX-MLIR, Snap ML (C++), PyTorch
| NVIDIA Triton Inference Server | An open-source inference-serving software for fast and scalable AI in applications. | TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more
| Seldon MLServer | An open-source inference server designed to simplify the deployment of machine learning models. | Scikit-Learn (sklearn), XGBoost, LightGBM, CatBoost, HuggingFace and MLflow

Expand All @@ -33,6 +34,7 @@ endif::[]
|===
| Name | Default protocol | Additional protocol | Model mesh support | Single node OpenShift support | Deployment mode

| IBM Z Accelerated for NVIDIA Triton Inference Server | gRPC | REST | No | Yes | Raw
| NVIDIA Triton Inference Server | gRPC | REST | Yes | Yes | Raw and serverless
| Seldon MLServer | gRPC | REST | No | Yes | Raw and serverless

Expand All @@ -54,3 +56,4 @@ ifndef::upstream[]
* link:{rhoaidocshome}{default-format-url}/serving_models/serving-large-models_serving-large-models#inference-endpoints_serving-large-models[Inference endpoints]
endif::[]

* link:https://github.com/IBM/ibmz-accelerated-for-nvidia-triton-inference-server?tab=readme-ov-file#rest-apis-[Using the IBM Z Accelerated for NVIDIA Triton Inference Server Container Image]