-
Notifications
You must be signed in to change notification settings - Fork 2
Installing RHOAI on OpenShift 2.25 or greater and adding LM‐Eval
Refer to the Red Hat docs here for more detail:
Table of Contents
- Prerequisites
- Install the Red Hat OpenShift AI Operator
- Monitor DSCI
- Install the Red Hat OpenShift AI components via DSC
- Check that everything is running
- TBD - Kueue setup
- Online LM-Eval Job
- Offline testing with unitxt
- Cleanup
- Dashboard
0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.16.17)
0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.
0.3 Also logged into the terminal with oc login: For example:
oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443
0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html
1.1 Create a namespace and OperatorGroup:
cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: redhat-ods-operator
EOF
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: rhods-operator
namespace: redhat-ods-operator
EOF
1.2 Install Servicemesh operator
Note, if you are installing in production, you probably want installPlanApproval: Manual so that you're not surprised with operator updates until you've had chance to verify them on a dev/stage server first.
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/servicemeshoperator.openshift-operators: ""
name: servicemeshoperator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: servicemeshoperator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
and make sure it works:
watch oc get pods,csv -n openshift-operators
and it should look something like this:
NAME READY STATUS RESTARTS AGE
istio-operator-6c99f6bf7b-rrh2j 1/1 Running 0 13m
1.3 Install the serverless operator
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: serverless-operators
namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: serverless-operator
namespace: openshift-serverless
spec:
channel: stable
name: serverless-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And then check it with:
watch oc get pods,csv -n openshift-serverless
Note, there might be at least two more pre-reqs for OC 2.19+
Authorinio:
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
creationTimestamp: "2025-04-16T16:24:30Z"
generation: 1
labels:
operators.coreos.com/authorino-operator.openshift-operators: ""
name: authorino-operator
namespace: openshift-operators
resourceVersion: "366895"
uid: d23b99b6-4493-45bd-8b74-9ef2a2e1fd9d
spec:
channel: stable
installPlanApproval: Automatic
name: authorino-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And kiali-ossm
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
creationTimestamp: "2025-04-16T16:29:49Z"
generation: 1
labels:
operators.coreos.com/kiali-ossm.openshift-operators: ""
name: kiali-ossm
namespace: openshift-operators
resourceVersion: "367134"
uid: 1f44a8bf-ca4e-4415-b70c-203c9b4f64bb
spec:
channel: stable
installPlanApproval: Automatic
name: kiali-ossm
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
1.4 Create a subscription (Recommend changing installPlanApproval to Manual in production)
Note: If you want the Stable RHOAI 2.16.x version instead, change the channel below from fast to stable
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
name: rhods-operator
channel: fast
installPlanApproval: Automatic
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And watch that it starts:
watch oc get pods,csv -n redhat-ods-operator
Watch the dsci until it's complete:
watch oc get dsci
and it'll finish up like this:
NAME AGE PHASE CREATED AT
default-dsci 16m Ready 2024-07-02T19:56:18Z
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Removed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Managed
defaultDeploymentMode: RawDeployment
serving:
ingressGateway:
certificate:
secretName: knative-serving-cert
type: SelfSigned
managementState: Managed
name: knative-serving
kueue:
managementState: Managed
modelmeshserving:
managementState: Removed
ray:
managementState: Removed
workbenches:
managementState: Removed
trainingoperator:
managementState: Managed
trustyai:
eval:
lmeval:
permitCodeExecution: allow
permitOnline: allow
managementState: Managed
EOF
4.1 Check that your operators are running:
oc get pods -n redhat-ods-operator
Will return:
NAME READY STATUS RESTARTS AGE
rhods-operator-7c54d9d6b5-j97mv 1/1 Running 0 22h
4.2 Check that the service mesh operator is running:
oc get pods -n openshift-operators
Will return:
NAME READY STATUS RESTARTS AGE
istio-cni-node-v2-5-9qkw7 1/1 Running 0 84s
istio-cni-node-v2-5-dbtz5 1/1 Running 0 84s
istio-cni-node-v2-5-drc9l 1/1 Running 0 84s
istio-cni-node-v2-5-k4x4t 1/1 Running 0 84s
istio-cni-node-v2-5-pbltn 1/1 Running 0 84s
istio-cni-node-v2-5-xbmz5 1/1 Running 0 84s
istio-operator-6c99f6bf7b-4ckdx 1/1 Running 1 (2m39s ago) 2m56s
4.3 Check that the DSC components are running:
watch oc get pods -n redhat-ods-applications
Will return:
NAME READY STATUS RESTARTS AGE
kserve-controller-manager-7784c9878b-4fkv9 1/1 Running 0 51s
kubeflow-training-operator-cb487d469-s78ch 1/1 Running 0 2m11s
kueue-controller-manager-5fb585c7c4-zpdcj 1/1 Running 0 4m21s
odh-model-controller-7b57f4b9d8-ztrgx 1/1 Running 0 5m6s
trustyai-service-operator-controller-manager-5745f74966-2hc2z 1/1 Running 0 2m16
See this Getting started with LM-Eval article for the latest info
6.1 Create a test namespace. For all the jobs, you need to run them in a namespace other than default for the moment.
oc create namespace test
oc project test
6.2. Submit a sample LM-Eval job
cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "online-lmeval-glue"
namespace: test
spec:
allowOnline: true
allowCodeExecution: true
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- card:
name: "cards.wnli"
#template: "templates.classification.multi_class.relation.default"
template:
name: "templates.classification.multi_class.relation.default"
logSamples: true
EOF
And then watch that it starts and runs:
watch oc get pods,lmevaljobs -n test
And once it pulls the image and runs for about 5 minutes it should look like this:
oc get pods,lmevaljobs -n test api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024
NAME READY STATUS RESTARTS AGE
pod/online-lmeval-glue 1/1 Running 0 25s
NAME STATE
lmevaljob.trustyai.opendatahub.io/lmeval-glue Running
To clean it up, do this:
oc delete lmevaljob online-lmeval-glue -n test
Another test is "Online-unitxt" you can try:
cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "online-unitxt"
namespace: test
spec:
allowOnline: true
model: hf
modelArgs:
- name: pretrained
value: "google/flan-t5-base"
taskList:
taskRecipes:
- card:
name: "cards.20_newsgroups_short"
#template: "templates.classification.multi_class.title"
template:
name: "templates.classification.multi_class.title"
logSamples: true
EOF
And then watch that it starts and runs:
watch oc get pods,lmevaljobs -n test
To clean it up, do this:
oc delete lmevaljob online-unitxt -n test
7.0.1 Create a test namespace
oc create namespace test
oc project test
7.0.2 Create a pvc to contain the offline models/etc.
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml
7.0.3 Check the pvc
oc get pvc -n test
And it should look something like this:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
lmeval-data Bound pvc-6ff1abbf-a995-459a-8e9d-f98e5bf1c2ae 20Gi RWO portworx-watson-assistant-sc 29s
7.1 UNITEXT Testing
7.1.1 Deploy a Pod that will copy the models and datasets to the PVC:
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/disconnected-flan-arceasy.yaml
7.1.2 Check for when it's complete
watch oc get pods -n test
7.1.3 Delete the lmeval-downloader pod once it's complete:
oc delete pod lmeval-downloader -n test
7.1.4 Apply the yaml for the ARCEasy
cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "offline-lmeval-arceasy-test"
labels:
opendatahub.io/dashboard: "true"
lmevaltests: "vllm"
spec:
model: hf
modelArgs:
- name: pretrained
value: "/opt/app-root/src/hf_home/flan"
taskList:
taskNames:
- "arc_easy"
logSamples: true
offline:
storage:
pvcName: "lmeval-data"
pod:
container:
env:
- name: HF_HUB_VERBOSITY
value: "debug"
- name: UNITXT_DEFAULT_VERBOSITY
value: "debug"
EOF
It should start up a pod
watch oc get pods -n test
and it'll look like this:
offline-lmeval-arceasy-test 0/1 Completed 0 14m
7.1.5 Even though the pod is done, it doesn't release the pvc, so the next test might fail if the first pod still exists and the loader job uses a different worker node. Therefore you should delete the first offline job once you're sure it's completed.
oc delete lmevaljob offline-lmeval-arceasy-test -n test
7.2 Testing UNITXT (This takes awhile)
7.2.1 Using the same pvc as above, apply the unitxt loader
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/downloader-flan-20newsgroups.yaml
7.2.2 Check for when it's complete
watch oc get pods -n test
7.2.3 Delete the lmeval-downloader pod
oc delete pod lmeval-downloader -n test
7.2.4 Apply the unitxt yml
cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "offline-lmeval-unitxt-test"
spec:
model: hf
modelArgs:
- name: pretrained
value: "/opt/app-root/src/hf_home/flan"
taskList:
taskRecipes:
- card:
name: "cards.20_newsgroups_short"
#template: "templates.classification.multi_class.title"
template:
name: "templates.classification.multi_class.title"
logSamples: true
offline:
storage:
pvcName: "lmeval-data"
pod:
container:
env:
- name: HF_HUB_VERBOSITY
value: "debug"
- name: UNITXT_DEFAULT_VERBOSITY
value: "debug"
EOF
And the pod should start up:
watch oc get pods -n test
And it'll look like this eventually when it's done:
offline-lmeval-unitxt-test 0/1 Completed 0 14m
8.1 Cleanup of your lmevaljob(s), for example
oc delete lmevaljob evaljob-sample -n test
oc delete lmevaljob offline-lmeval-arceasy-test offline-lmeval-unitxt-test -n test
8.2 Cleanup of your Kueue resouces, if you want that:
oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2
8.3 Cleanup of dsc items (if you want that)
oc delete dsc default-dsc
8.4 Cleanup of DSCI (if you want that)
oc delete dsci default-dsci
8.5 Cleanup of the Operators (if you want that)
oc delete sub servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete sub authorino-operator -n openshift-operators
oc delete sub kiali-ossm -n openshift-operators
oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv servicemeshoperator.v2.6.11 -n openshift-operators
oc delete csv serverless-operator.v1.36.1 -n openshift-serverless
oc delete csv authorino-operator.v1.2.3 kiali-operator.v2.11.4 -n redhat-ods-applications-auth-provider
oc delete crd servicemeshcontrolplanes.maistra.io servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io servicemeshpolicies.authentication.maistra.io servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io
oc delete csv rhods-operator.2.25.0 -n redhat-ods-operator
8.6 Cleanup of the operatorgroup
oc delete OperatorGroup rhods-operator -n redhat-ods-operator
Note, I'm getting info about this from: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/monitoring_data_science_models/evaluating-large-language-models_monitor#performing-model-evaluations-in-the-dashboard_monitor
- oc edit dsc and change dashboard to managed like this:
oc edit dsc default-dsc
and change the dashboard to managed:
dashboard:
managementState: Managed
Also change these things to managed too:
codeflare:
datasciencepipelines:
Managed
ray:
trainingoperator:
workbenches:
kueue:
modelmeshserving:
datasciencepipelines:
- edit the dashboard cr
oc edit odhdashboardconfig odh-dashboard-config -n redhat-ods-applications
and add this line in the dahboardConfig spec:
disableLMEval: false
- Note you might have to restart the dashboard pods after this, I'm not sure
oc get pods -n redhat-ods-applications
- Get the route for the dashboard:
oc get route -n redhat-ods-applications
and then put this host url into the in the browser, for example
http://rhods-dashboard-redhat-ods-applications.apps.jim416.cp.fyre.ibm.com
-
Click on Models --> Model evaluation runs
-
Install trustyaiservice in your project. I'm going to do it to "test"
-
I've done all the steps to get RHOAI and the dashboard up.
-
I'm watching this video Guide to Deploying AI Models on Red Hat OpenShift AI
2.1 oc login into the server on the cli
https://github.com/IsaiahStapleton/rhoai-model-deployment-guide
2.2 On the server
mkdir JIM ; cd JIM
git clone https://github.com/IsaiahStapleton/rhoai-model-deployment-guide.git
cd rhoai-model-deployment-guide/
oc create project test ; oc project test
2.3 Apply minio
oc apply -f minio-setup.yaml
and check it:
oc get pods,routes
- Download a model
3.1 go to hugging face and find a good granite model
https://huggingface.co/ibm-granite/granite-4.0-h-micro/tree/main
3.2 First you need to login to Hugging face and generate an access token:
1. Navigate to settings -> access tokens
2. Select create new token
3. For token type, select Read and then give it a name
4. Copy the token
- Download the model on your laptop
cd /Users/jamesbusche/projects/LM-EVAL
git clone https://jbusche:[email protected]/ibm-granite/granite-4.0-h-micro
- Upload the model to the S3 Minio storage
5.1 Login to the mini ui from oc get routes -n test
oc get route -n test |grep minio-ui
minio-ui minio-ui-test.apps.jim4162.cp.fyre.ibm.com minio-service ui edge/Redirect None
Put in the browser the output, for example: https://minio-ui-test.apps.jim4162.cp.fyre.ibm.com/ user = minio password = mino123
5.2 Create a bucket with a name. He suggestions "models"
5.3 Click on Upload Folder. Pick the laptop folder where hugging face was downloaded.
folder would be: /Users/jamesbusche/projects/LM-EVAL/granite-4.0-h-micro
-
Deploying the model on RHOAI
6.1 Login to your RHOAI dashboard
oc get route -n redhat-ods-applications
https://rhods-dashboard-redhat-ods-applications.apps.jim4162.cp.fyre.ibm.com
kubeadmin/PTUiG-k6chu-GCFrS-mS8XL
6.2 within your Data Science project (test), navigate to connections and select create a connection
6.3 fill in the following values:
get the endpoint from:
oc get route -n test |grep minio-api
minio-api minio-api-test.apps.jim4162.cp.fyre.ibm.com minio-service api edge/Redirect None
Connection type: S3 compatible object storage - v1
Connection name: My Data Connection
Access key: minio
secret key: minio123
Endpoint: https://minio-ui-test.apps.jim4162.cp.fyre.ibm.com
Region: (leaving blank)
Bucket: models
7. Deploy your model
7.1 Navigate to models under your project to deploy a model
• Model deployment name: demo-granite
• Serving runtime: vLLM ServingRuntime for KServe
• Model server size: You can select whatever size you wish, for this guide I will keep the small size
• Accelerator: Select NVIDIA GPU
• Model route: Select check box for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster
• Token authentication: Select check box for "Require token authentication" this makes it so that sending requests to the model endpoint requires a token, which is important for security. You can leave the service account name as default-name
• Source model location: Select the data connection that you set up in step 4.1 (My Data Connection). Then provide it the path to your model. (granite-4.0-h-micro)
7.2 Hit Deploy
8. I can see that it is starting…
oc get pods
NAME READY STATUS RESTARTS AGE
demo-granite-predictor-668b5b7c47-bkx76 0/2 Init:Error 2 (31s ago) 74s
minio-6fbc45498-cj8th 1/1 Running 0 20h
I'm getting this:
oc logs -f demo-granite-predictor-668b5b7c47-bkx76 -c storage-initializer
botocore.exceptions.SSLError: SSL validation failed for https://minio-api-test.apps.jim4162.cp.fyre.ibm.com/models?prefix=granite-4.0-h-micro&encoding-type=url [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)