Skip to content

Installing RHOAI on OpenShift 2.25 or greater and adding LM‐Eval

James Busche edited this page Nov 11, 2025 · 16 revisions

Refer to the Red Hat docs here for more detail:

https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.25/pdf/monitoring_data_science_models/Red_Hat_OpenShift_AI_Self-Managed-2.25-Monitoring_data_science_models-en-US.pdf

Table of Contents

0. Prerequisites

0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.16.17)

0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.

0.3 Also logged into the terminal with oc login: For example:

oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443

Note: If you have a GPU cluster:

0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html

1. Install the Red Hat OpenShift AI Operator

1.1 Create a namespace and OperatorGroup:

cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: redhat-ods-operator 
EOF
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
EOF

1.2 Install Servicemesh operator

Note, if you are installing in production, you probably want installPlanApproval: Manual so that you're not surprised with operator updates until you've had chance to verify them on a dev/stage server first.

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/servicemeshoperator.openshift-operators: ""
  name: servicemeshoperator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: servicemeshoperator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

and make sure it works:

watch oc get pods,csv -n openshift-operators

and it should look something like this:

NAME                              READY   STATUS    RESTARTS   AGE
istio-operator-6c99f6bf7b-rrh2j   1/1     Running   0          13m

1.3 Install the serverless operator

cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: serverless-operators
  namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: serverless-operator
  namespace: openshift-serverless
spec:
  channel: stable 
  name: serverless-operator 
  source: redhat-operators 
  sourceNamespace: openshift-marketplace 
EOF

And then check it with:

watch oc get pods,csv -n openshift-serverless

Note, there might be at least two more pre-reqs for OC 2.19+

Authorinio:

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: "2025-04-16T16:24:30Z"
  generation: 1
  labels:
    operators.coreos.com/authorino-operator.openshift-operators: ""
  name: authorino-operator
  namespace: openshift-operators
  resourceVersion: "366895"
  uid: d23b99b6-4493-45bd-8b74-9ef2a2e1fd9d
spec:
  channel: stable
  installPlanApproval: Automatic
  name: authorino-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

And kiali-ossm

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: "2025-04-16T16:29:49Z"
  generation: 1
  labels:
    operators.coreos.com/kiali-ossm.openshift-operators: ""
  name: kiali-ossm
  namespace: openshift-operators
  resourceVersion: "367134"
  uid: 1f44a8bf-ca4e-4415-b70c-203c9b4f64bb
spec:
  channel: stable
  installPlanApproval: Automatic
  name: kiali-ossm
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

1.4 Create a subscription (Recommend changing installPlanApproval to Manual in production)

Note: If you want the Stable RHOAI 2.16.x version instead, change the channel below from fast to stable

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator 
spec:
  name: rhods-operator
  channel: fast
  installPlanApproval: Automatic 
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

And watch that it starts:

watch oc get pods,csv -n redhat-ods-operator

2. Monitor DSCI

Watch the dsci until it's complete:

watch oc get dsci

and it'll finish up like this:

NAME           AGE   PHASE   CREATED AT
default-dsci   16m   Ready   2024-07-02T19:56:18Z

3. Install the Red Hat OpenShift AI components via DSC

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Managed
      defaultDeploymentMode: RawDeployment
      serving:
        ingressGateway:
          certificate:
            secretName: knative-serving-cert
            type: SelfSigned
        managementState: Managed
        name: knative-serving 
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    ray:
      managementState: Removed
    workbenches:
      managementState: Removed
    trainingoperator:
      managementState: Managed
    trustyai:
      eval:
        lmeval:
          permitCodeExecution: allow
          permitOnline: allow
      managementState: Managed
EOF

4. Check that everything is running

4.1 Check that your operators are running:

oc get pods -n redhat-ods-operator

Will return:

NAME                              READY   STATUS    RESTARTS   AGE
rhods-operator-7c54d9d6b5-j97mv   1/1     Running   0          22h

4.2 Check that the service mesh operator is running:

oc get pods -n openshift-operators 

Will return:

NAME                              READY   STATUS    RESTARTS        AGE
istio-cni-node-v2-5-9qkw7         1/1     Running   0               84s
istio-cni-node-v2-5-dbtz5         1/1     Running   0               84s
istio-cni-node-v2-5-drc9l         1/1     Running   0               84s
istio-cni-node-v2-5-k4x4t         1/1     Running   0               84s
istio-cni-node-v2-5-pbltn         1/1     Running   0               84s
istio-cni-node-v2-5-xbmz5         1/1     Running   0               84s
istio-operator-6c99f6bf7b-4ckdx   1/1     Running   1 (2m39s ago)   2m56s

4.3 Check that the DSC components are running:

watch oc get pods -n redhat-ods-applications

Will return:

NAME                                                            READY   STATUS    RESTARTS   AGE
kserve-controller-manager-7784c9878b-4fkv9                      1/1     Running   0          51s
kubeflow-training-operator-cb487d469-s78ch                      1/1     Running   0          2m11s
kueue-controller-manager-5fb585c7c4-zpdcj                       1/1     Running   0          4m21s
odh-model-controller-7b57f4b9d8-ztrgx                           1/1     Running   0          5m6s
trustyai-service-operator-controller-manager-5745f74966-2hc2z   1/1     Running   0          2m16

6. Online lm-eval job

See this Getting started with LM-Eval article for the latest info

6.1 Create a test namespace. For all the jobs, you need to run them in a namespace other than default for the moment.

oc create namespace test
oc project test

6.2. Submit a sample LM-Eval job

cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "online-lmeval-glue"
  namespace: test
spec:
  allowOnline: true
  allowCodeExecution: true
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      #template: "templates.classification.multi_class.relation.default"
      template:
        name: "templates.classification.multi_class.relation.default"
  logSamples: true
EOF

And then watch that it starts and runs:

watch oc get pods,lmevaljobs -n test

And once it pulls the image and runs for about 5 minutes it should look like this:

oc get pods,lmevaljobs -n test                                                                                                 api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024

NAME                 READY   STATUS    RESTARTS   AGE
pod/online-lmeval-glue   1/1     Running   0          25s

NAME                                               STATE
lmevaljob.trustyai.opendatahub.io/lmeval-glue   Running

To clean it up, do this:

oc delete lmevaljob online-lmeval-glue -n test

Another test is "Online-unitxt" you can try:

cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "online-unitxt"
  namespace: test
spec:
  allowOnline: true
  model: hf
  modelArgs:
    - name: pretrained
      value: "google/flan-t5-base"
  taskList:
    taskRecipes:
      - card:
          name: "cards.20_newsgroups_short"
        #template: "templates.classification.multi_class.title"
        template:
          name: "templates.classification.multi_class.title"
  logSamples: true
EOF

And then watch that it starts and runs:

watch oc get pods,lmevaljobs -n test

To clean it up, do this:

oc delete lmevaljob online-unitxt -n test

7. Offline Testing with Unitxt

7.0.1 Create a test namespace

oc create namespace test
oc project test

7.0.2 Create a pvc to contain the offline models/etc.

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml

7.0.3 Check the pvc

oc get pvc -n test

And it should look something like this:

NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                   AGE
lmeval-data   Bound    pvc-6ff1abbf-a995-459a-8e9d-f98e5bf1c2ae   20Gi       RWO            portworx-watson-assistant-sc   29s

7.1 UNITEXT Testing

7.1.1 Deploy a Pod that will copy the models and datasets to the PVC:

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/disconnected-flan-arceasy.yaml

7.1.2 Check for when it's complete

watch oc get pods -n test

7.1.3 Delete the lmeval-downloader pod once it's complete:

oc delete pod lmeval-downloader -n test

7.1.4 Apply the yaml for the ARCEasy

cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "offline-lmeval-arceasy-test"
  labels:
    opendatahub.io/dashboard: "true"
    lmevaltests: "vllm"
spec:
  model: hf
  modelArgs:
    - name: pretrained
      value: "/opt/app-root/src/hf_home/flan"
  taskList:
    taskNames:
      - "arc_easy"
  logSamples: true
  offline:
    storage:
      pvcName: "lmeval-data"
  pod:
    container:
      env:
        - name: HF_HUB_VERBOSITY
          value: "debug"
        - name: UNITXT_DEFAULT_VERBOSITY
          value: "debug"
EOF

It should start up a pod

watch oc get pods -n test

and it'll look like this:

offline-lmeval-arceasy-test         0/1     Completed   0          14m

7.1.5 Even though the pod is done, it doesn't release the pvc, so the next test might fail if the first pod still exists and the loader job uses a different worker node. Therefore you should delete the first offline job once you're sure it's completed.

oc delete lmevaljob offline-lmeval-arceasy-test -n test

7.2 Testing UNITXT (This takes awhile)

7.2.1 Using the same pvc as above, apply the unitxt loader

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/downloader-flan-20newsgroups.yaml

7.2.2 Check for when it's complete

watch oc get pods -n test

7.2.3 Delete the lmeval-downloader pod

oc delete pod lmeval-downloader -n test

7.2.4 Apply the unitxt yml

cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "offline-lmeval-unitxt-test"
spec:
  model: hf
  modelArgs:
    - name: pretrained
      value: "/opt/app-root/src/hf_home/flan"
  taskList:
    taskRecipes:
      - card:
          name: "cards.20_newsgroups_short"
        #template: "templates.classification.multi_class.title"
        template:
          name: "templates.classification.multi_class.title"
  logSamples: true
  offline:
    storage:
      pvcName: "lmeval-data"
  pod:
    container:
      env:
        - name: HF_HUB_VERBOSITY
          value: "debug"
        - name: UNITXT_DEFAULT_VERBOSITY
          value: "debug"
EOF

And the pod should start up:

watch oc get pods -n test

And it'll look like this eventually when it's done:

offline-lmeval-unitxt-test         0/1     Completed   0          14m

8. Cleanup

8.1 Cleanup of your lmevaljob(s), for example

oc delete lmevaljob evaljob-sample -n test
oc delete lmevaljob offline-lmeval-arceasy-test offline-lmeval-unitxt-test -n test

8.2 Cleanup of your Kueue resouces, if you want that:

oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2

8.3 Cleanup of dsc items (if you want that)

oc delete dsc default-dsc

8.4 Cleanup of DSCI (if you want that)

oc delete dsci default-dsci

8.5 Cleanup of the Operators (if you want that)

oc delete sub servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete sub authorino-operator -n openshift-operators
oc delete sub kiali-ossm -n openshift-operators
oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv servicemeshoperator.v2.6.11 -n openshift-operators
oc delete csv serverless-operator.v1.36.1 -n openshift-serverless
oc delete csv authorino-operator.v1.2.3 kiali-operator.v2.11.4  -n redhat-ods-applications-auth-provider
oc delete crd servicemeshcontrolplanes.maistra.io  servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io  servicemeshpolicies.authentication.maistra.io  servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io
oc delete csv rhods-operator.2.25.0 -n redhat-ods-operator

8.6 Cleanup of the operatorgroup

oc delete OperatorGroup rhods-operator -n redhat-ods-operator

Getting the dashboard up and running

Note, I'm getting info about this from: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/monitoring_data_science_models/evaluating-large-language-models_monitor#performing-model-evaluations-in-the-dashboard_monitor

  1. oc edit dsc and change dashboard to managed like this:
oc edit dsc default-dsc 

and change the dashboard to managed:

    dashboard:
      managementState: Managed

Also change these things to managed too:

codeflare:
datasciencepipelines:
Managed
ray:
trainingoperator:
workbenches:
kueue:
modelmeshserving:
datasciencepipelines:
  1. edit the dashboard cr
oc edit odhdashboardconfig odh-dashboard-config -n redhat-ods-applications

and add this line in the dahboardConfig spec:

    disableLMEval: false
  1. Note you might have to restart the dashboard pods after this, I'm not sure
oc get pods -n redhat-ods-applications 
  1. Get the route for the dashboard:
oc get route -n redhat-ods-applications

and then put this host url into the in the browser, for example

http://rhods-dashboard-redhat-ods-applications.apps.jim416.cp.fyre.ibm.com

  1. Click on Models --> Model evaluation runs

  2. Install trustyaiservice in your project. I'm going to do it to "test"

Deploying a Model

  1. I've done all the steps to get RHOAI and the dashboard up.

  2. I'm watching this video Guide to Deploying AI Models on Red Hat OpenShift AI

2.1 oc login into the server on the cli

https://github.com/IsaiahStapleton/rhoai-model-deployment-guide

2.2 On the server

mkdir JIM ; cd JIM
git clone https://github.com/IsaiahStapleton/rhoai-model-deployment-guide.git
cd rhoai-model-deployment-guide/
oc create project test ; oc project test

2.3 Apply minio

oc apply -f minio-setup.yaml

and check it:

oc get pods,routes
  1. Download a model

3.1 go to hugging face and find a good granite model

    https://huggingface.co/ibm-granite/granite-4.0-h-micro/tree/main

3.2 First you need to login to Hugging face and generate an access token:

        1. Navigate to settings -> access tokens
        2. Select create new token
        3. For token type, select Read and then give it a name
        4. Copy the token
  1. Download the model on your laptop
cd /Users/jamesbusche/projects/LM-EVAL
git clone https://jbusche:[email protected]/ibm-granite/granite-4.0-h-micro
  1. Upload the model to the S3 Minio storage

5.1 Login to the mini ui from oc get routes -n test

oc get route -n test |grep minio-ui

minio-ui minio-ui-test.apps.jim4162.cp.fyre.ibm.com minio-service ui edge/Redirect None

Put in the browser the output, for example: https://minio-ui-test.apps.jim4162.cp.fyre.ibm.com/ user = minio password = mino123

5.2 Create a bucket with a name. He suggestions "models"

5.3 Click on Upload Folder. Pick the laptop folder where hugging face was downloaded.

folder would be: /Users/jamesbusche/projects/LM-EVAL/granite-4.0-h-micro

  1. Deploying the model on RHOAI

     6.1 Login to your RHOAI dashboard
    
        oc get route -n redhat-ods-applications
    https://rhods-dashboard-redhat-ods-applications.apps.jim4162.cp.fyre.ibm.com

    kubeadmin/PTUiG-k6chu-GCFrS-mS8XL

    6.2 within your Data Science project (test), navigate to connections and select create a connection

    6.3 fill in the following values:
    
    get the endpoint from: 
        oc get route -n test |grep minio-api
    minio-api   minio-api-test.apps.jim4162.cp.fyre.ibm.com          minio-service   api    edge/Redirect   None
        Connection type: S3 compatible object storage - v1
        Connection name: My Data Connection
        Access key: minio
        secret key: minio123
        Endpoint: https://minio-ui-test.apps.jim4162.cp.fyre.ibm.com
        Region: (leaving blank)
        Bucket: models
    7. Deploy your model

    7.1 Navigate to models under your project to deploy a model
        • Model deployment name: demo-granite
        • Serving runtime: vLLM ServingRuntime for KServe
        • Model server size: You can select whatever size you wish, for this guide I will keep the small size
        • Accelerator: Select NVIDIA GPU
        • Model route: Select check box for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster
        • Token authentication: Select check box for "Require token authentication" this makes it so that sending requests to the model endpoint requires a token, which is important for security. You can leave the service account name as default-name
        • Source model location: Select the data connection that you set up in step 4.1 (My Data Connection). Then provide it the path to your model. (granite-4.0-h-micro)
    7.2 Hit Deploy
    
    8. I can see that it is starting…
        oc get pods
        NAME                                      READY   STATUS       RESTARTS      AGE
        demo-granite-predictor-668b5b7c47-bkx76   0/2     Init:Error   2 (31s ago)   74s
        minio-6fbc45498-cj8th                     1/1     Running      0             20h
    I'm getting this:
        oc logs -f demo-granite-predictor-668b5b7c47-bkx76 -c storage-initializer
    botocore.exceptions.SSLError: SSL validation failed for https://minio-api-test.apps.jim4162.cp.fyre.ibm.com/models?prefix=granite-4.0-h-micro&encoding-type=url [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)