Installing RHOAI 3.0.0 on OpenShift 4.20.x and adding LM‐Eval

Refer to the Red Hat docs here for more detail:

https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.25/pdf/monitoring_data_science_models/Red_Hat_OpenShift_AI_Self-Managed-2.25-Monitoring_data_science_models-en-US.pdf

Table of Contents

0. Prerequisites

0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.20.5. I don't know if 3.0.0 works with any OC version older...)

0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.

0.3 Also logged into the terminal with oc login: For example:

oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443

Note: If you have a GPU cluster:

0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html

If you want to run LM-Eval offlline, you need to have a pvc. This is how I did it on my 4.20.5 cluster

0.5 Install Portworx with these steps: 0.5.1 Login to the openshift console and click on:

Ecosystem --> Software catalog

and search for portworx and click install with all the defaults.

0.5.2 Approve the install

0.5.3 Click on "Create StorageCluster" and accept the defaults, click Create at the bottom.

0.5.4 Watch the status of the portworx storage cluster in the console, or you can issue a command line query like this:

watch oc get storagecluster,pods -n openshift-operators

Which will return:

NAME       CLUSTER UUID   STATUS         VERSION   AGE
portworx                  Initializing   3.5.0     102s

0.5.5 Configure one of the existing storage classes as default. Let's try this:

oc patch sc px-csi-db -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

0.5.6 Create the test namespace and switch to it:

oc create namespace test
oc project test

0.5.7 Now make a pvc and see if it works:

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml

0.5.8 Check the pvc

oc get pvc -n test

1. Install the Red Hat OpenShift AI Operator

1.1 Create a namespace and OperatorGroup:

cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: redhat-ods-operator 
EOF
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
EOF

1.2 Create a subscription (Recommend changing installPlanApproval to Manual in production)

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/rhods-operator.redhat-ods-operator: ""
  name: rhods-operator
  namespace: redhat-ods-operator
spec:
  channel: fast-3.x
  installPlanApproval: Automatic
  name: rhods-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: rhods-operator.3.0.0
EOF

And watch that it starts:

watch oc get pods,csv -n redhat-ods-operator

2. Monitor DSCI

Watch the dsci until it's complete:

watch oc get dsci

and it'll finish up like this:

NAME           AGE   PHASE   CREATED AT
default-dsci   16m   Ready   2024-07-02T19:56:18Z

3. Install the Red Hat OpenShift AI components via DSC

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v2
kind: DataScienceCluster
metadata:
  creationTimestamp: "2025-12-10T23:25:57Z"
  generation: 1
  labels:
    app.kubernetes.io/name: datasciencecluster
  name: default-dsc
  resourceVersion: "82083"
  uid: 55368194-f183-4662-8477-157532cbd1e3
spec:
  components:
    aipipelines:
      argoWorkflowsControllers:
        managementState: Managed
      managementState: Managed
    dashboard:
      managementState: Managed
    feastoperator:
      managementState: Removed
    kserve:
      managementState: Managed
      nim:
        managementState: Managed
      rawDeploymentServiceConfig: Headless
    kueue:
      defaultClusterQueueName: default
      defaultLocalQueueName: default
      managementState: Removed
    llamastackoperator:
      managementState: Removed
    modelregistry:
      managementState: Managed
      registriesNamespace: rhoai-model-registries
    ray:
      managementState: Managed
    trainingoperator:
      managementState: Managed
    trustyai:
      eval:
        lmeval:
          permitCodeExecution: allow
          permitOnline: allow
      managementState: Managed
    workbenches:
      managementState: Managed
      workbenchNamespace: rhods-notebooks
EOF

4. Check that everything is running

4.1 Check that your operators are running:

oc get pods -n redhat-ods-operator

Will return:

NAME                              READY   STATUS    RESTARTS   AGE
rhods-operator-7c54d9d6b5-j97mv   1/1     Running   0          22h

4.2 Check that the DSC components are running:

watch oc get pods -n redhat-ods-applications

Will return:

NAME                                                              READY   STATUS    RESTARTS   AGE
data-science-pipelines-operator-controller-manager-575f788ht6h4   1/1     Running   0          5m
kserve-controller-manager-64b497ccdd-2b276                        1/1     Running   0          4m40s
kubeflow-training-operator-64664c57b-6qb9p                        1/1     Running   0          4m51s
kuberay-operator-66c9dc86f6-hlcwd                                 1/1     Running   0          4m44s
model-registry-operator-controller-manager-55cdf79bc6-5vxq5	  1/1     Running   0          4m59s
notebook-controller-deployment-6697968bbf-9clnc                   1/1     Running   0          4m56s
odh-model-controller-58dfc575d7-6cvhh                             1/1     Running   0          5m
odh-notebook-controller-manager-6c65f4d46f-ntl9m                  1/1     Running   0          4m58s
rhods-dashboard-6c8744b89-f9ggh                                   4/4     Running   0          4m55s
rhods-dashboard-6c8744b89-t7f8j                                   4/4     Running   0          4m55s
trustyai-service-operator-controller-manager-58d9bc5459-4p8g9     1/1     Running   0          4m41s

6. Online lm-eval job

See this Getting started with LM-Eval article for the latest info

6.1 Create a test namespace. For all the jobs, you need to run them in a namespace other than default for the moment.

oc create namespace test
oc project test

6.2. Submit a sample LM-Eval job

cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "online-lmeval-glue"
  namespace: test
spec:
  allowOnline: true
  allowCodeExecution: true
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      #template: "templates.classification.multi_class.relation.default"
      template:
        name: "templates.classification.multi_class.relation.default"
  logSamples: true
EOF

And then watch that it starts and runs:

watch oc get pods,lmevaljobs -n test

And once it pulls the image and runs for about 5 minutes it should look like this:

oc get pods,lmevaljobs -n test                                                                                                 api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024

NAME                 READY   STATUS    RESTARTS   AGE
pod/online-lmeval-glue   1/1     Running   0          25s

NAME                                               STATE
lmevaljob.trustyai.opendatahub.io/lmeval-glue   Running

To clean it up, do this:

oc delete lmevaljob online-lmeval-glue -n test

Another test is "Online-unitxt" you can try:

cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "online-unitxt"
  namespace: test
spec:
  allowOnline: true
  model: hf
  modelArgs:
    - name: pretrained
      value: "google/flan-t5-base"
  taskList:
    taskRecipes:
      - card:
          name: "cards.20_newsgroups_short"
        #template: "templates.classification.multi_class.title"
        template:
          name: "templates.classification.multi_class.title"
  logSamples: true
EOF

And then watch that it starts and runs:

watch oc get pods,lmevaljobs -n test

To clean it up, do this:

oc delete lmevaljob online-unitxt -n test

7. Offline Testing with Unitxt

7.0.1 Create a test namespace

oc create namespace test
oc project test

7.0.2 Create a pvc to contain the offline models/etc.

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml

7.0.3 Check the pvc

oc get pvc -n test

And it should look something like this:

NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                   AGE
lmeval-data   Bound    pvc-6ff1abbf-a995-459a-8e9d-f98e5bf1c2ae   20Gi       RWO            portworx-watson-assistant-sc   29s

7.1 UNITEXT Testing

7.1.1 Deploy a Pod that will copy the models and datasets to the PVC:

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/disconnected-flan-arceasy.yaml

7.1.2 Check for when it's complete

watch oc get pods -n test

7.1.3 Delete the lmeval-downloader pod once it's complete:

oc delete pod lmeval-downloader -n test

7.1.4 Apply the yaml for the ARCEasy

cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "offline-lmeval-arceasy-test"
  labels:
    opendatahub.io/dashboard: "true"
    lmevaltests: "vllm"
spec:
  model: hf
  modelArgs:
    - name: pretrained
      value: "/opt/app-root/src/hf_home/flan"
  taskList:
    taskNames:
      - "arc_easy"
  logSamples: true
  offline:
    storage:
      pvcName: "lmeval-data"
  pod:
    container:
      env:
        - name: HF_HUB_VERBOSITY
          value: "debug"
        - name: UNITXT_DEFAULT_VERBOSITY
          value: "debug"
EOF

It should start up a pod

watch oc get pods -n test

and it'll look like this:

offline-lmeval-arceasy-test         0/1     Completed   0          14m

7.1.5 Even though the pod is done, it doesn't release the pvc, so the next test might fail if the first pod still exists and the loader job uses a different worker node. Therefore you should delete the first offline job once you're sure it's completed.

oc delete lmevaljob offline-lmeval-arceasy-test -n test

7.2 Testing UNITXT (This takes awhile)

7.2.1 Using the same pvc as above, apply the unitxt loader

oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/downloader-flan-20newsgroups.yaml

7.2.2 Check for when it's complete

watch oc get pods -n test

7.2.3 Delete the lmeval-downloader pod

oc delete pod lmeval-downloader -n test

7.2.4 Apply the unitxt yml

cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: "offline-lmeval-unitxt-test"
spec:
  model: hf
  modelArgs:
    - name: pretrained
      value: "/opt/app-root/src/hf_home/flan"
  taskList:
    taskRecipes:
      - card:
          name: "cards.20_newsgroups_short"
        #template: "templates.classification.multi_class.title"
        template:
          name: "templates.classification.multi_class.title"
  logSamples: true
  offline:
    storage:
      pvcName: "lmeval-data"
  pod:
    container:
      env:
        - name: HF_HUB_VERBOSITY
          value: "debug"
        - name: UNITXT_DEFAULT_VERBOSITY
          value: "debug"
EOF

And the pod should start up:

watch oc get pods -n test

And it'll look like this eventually when it's done:

offline-lmeval-unitxt-test         0/1     Completed   0          14m

8. Cleanup

8.1 Cleanup of your lmevaljob(s), for example

oc delete lmevaljob evaljob-sample -n test
oc delete lmevaljob offline-lmeval-arceasy-test offline-lmeval-unitxt-test -n test

8.2 Cleanup of your Kueue resouces, if you want that:

oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2

8.3 Cleanup of dsc items (if you want that)

oc delete dsc default-dsc

8.4 Cleanup of DSCI (if you want that)

oc delete dsci default-dsci

8.5 Cleanup of the Operators (if you want that)

oc delete sub servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete sub authorino-operator -n openshift-operators
oc delete sub kiali-ossm -n openshift-operators
oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv servicemeshoperator.v2.6.11 -n openshift-operators
oc delete csv serverless-operator.v1.36.1 -n openshift-serverless
oc delete csv authorino-operator.v1.2.3 kiali-operator.v2.11.4  -n redhat-ods-applications-auth-provider
oc delete crd servicemeshcontrolplanes.maistra.io  servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io  servicemeshpolicies.authentication.maistra.io  servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io
oc delete csv rhods-operator.2.25.0 -n redhat-ods-operator

8.6 Cleanup of the operatorgroup

oc delete OperatorGroup rhods-operator -n redhat-ods-operator

Getting the dashboard up and running

Note, I'm getting info about this from: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/monitoring_data_science_models/evaluating-large-language-models_monitor#performing-model-evaluations-in-the-dashboard_monitor

oc edit dsc and change dashboard to managed like this:

oc edit dsc default-dsc

and change the dashboard to managed:

    dashboard:
      managementState: Managed

Also change these things to managed too:

codeflare:
datasciencepipelines:
Managed
ray:
trainingoperator:
workbenches:
kueue:
modelmeshserving:
datasciencepipelines:

edit the dashboard cr

oc edit odhdashboardconfig odh-dashboard-config -n redhat-ods-applications

and add this line in the dahboardConfig spec:

    disableLMEval: false

Note you might have to restart the dashboard pods after this, I'm not sure

oc get pods -n redhat-ods-applications

Get the route for the dashboard:

oc get route -n redhat-ods-applications

and then put this host url into the in the browser, for example

http://rhods-dashboard-redhat-ods-applications.apps.jim416.cp.fyre.ibm.com

Click on Models --> Model evaluation runs
Install trustyaiservice in your project. I'm going to do it to "test"

Deploying a Model

I've done all the steps to get RHOAI and the dashboard up.
I'm watching this video Guide to Deploying AI Models on Red Hat OpenShift AI

2.1 oc login into the server on the cli

https://github.com/IsaiahStapleton/rhoai-model-deployment-guide

2.2 On the server

mkdir JIM ; cd JIM
git clone https://github.com/IsaiahStapleton/rhoai-model-deployment-guide.git
cd rhoai-model-deployment-guide/
oc create project test ; oc project test

2.3 Apply minio

oc apply -f minio-setup.yaml

and check it:

oc get pods,routes

Download a model

3.1 go to hugging face and find a good granite model

    https://huggingface.co/ibm-granite/granite-4.0-h-micro/tree/main

3.2 First you need to login to Hugging face and generate an access token:

        1. Navigate to settings -> access tokens
        2. Select create new token
        3. For token type, select Read and then give it a name
        4. Copy the token

Download the model on your laptop

cd /Users/jamesbusche/projects/LM-EVAL
git clone https://jbusche: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@huggingface.co/ibm-granite/granite-4.0-h-micro

Upload the model to the S3 Minio storage

5.1 Login to the mini ui from oc get routes -n test

oc get route -n test |grep minio-ui

minio-ui minio-ui-test.apps.jim4162.cp.fyre.ibm.com minio-service ui edge/Redirect None

Put in the browser the output, for example: https://minio-ui-test.apps.jim4162.cp.fyre.ibm.com/ user = minio password = mino123

5.2 Create a bucket with a name. He suggestions "models"

5.3 Click on Upload Folder. Pick the laptop folder where hugging face was downloaded.

folder would be: /Users/jamesbusche/projects/LM-EVAL/granite-4.0-h-micro

Deploying the model on RHOAI
```
 6.1 Login to your RHOAI dashboard
```

        oc get route -n redhat-ods-applications

    https://rhods-dashboard-redhat-ods-applications.apps.jim4162.cp.fyre.ibm.com

    kubeadmin/PTUiG-k6chu-GCFrS-mS8XL

    6.2 within your Data Science project (test), navigate to connections and select create a connection

    6.3 fill in the following values:
    
    get the endpoint from:

        oc get route -n test |grep minio-api

    minio-api   minio-api-test.apps.jim4162.cp.fyre.ibm.com          minio-service   api    edge/Redirect   None

        Connection type: S3 compatible object storage - v1
        Connection name: My Data Connection
        Access key: minio
        secret key: minio123
        Endpoint: https://minio-ui-test.apps.jim4162.cp.fyre.ibm.com
        Region: (leaving blank)
        Bucket: models

    7. Deploy your model

    7.1 Navigate to models under your project to deploy a model

        • Model deployment name: demo-granite
        • Serving runtime: vLLM ServingRuntime for KServe
        • Model server size: You can select whatever size you wish, for this guide I will keep the small size
        • Accelerator: Select NVIDIA GPU
        • Model route: Select check box for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster
        • Token authentication: Select check box for "Require token authentication" this makes it so that sending requests to the model endpoint requires a token, which is important for security. You can leave the service account name as default-name
        • Source model location: Select the data connection that you set up in step 4.1 (My Data Connection). Then provide it the path to your model. (granite-4.0-h-micro)

    7.2 Hit Deploy
    
    8. I can see that it is starting…

        oc get pods

        NAME                                      READY   STATUS       RESTARTS      AGE
        demo-granite-predictor-668b5b7c47-bkx76   0/2     Init:Error   2 (31s ago)   74s
        minio-6fbc45498-cj8th                     1/1     Running      0             20h

    I'm getting this:

        oc logs -f demo-granite-predictor-668b5b7c47-bkx76 -c storage-initializer

    botocore.exceptions.SSLError: SSL validation failed for https://minio-api-test.apps.jim4162.cp.fyre.ibm.com/models?prefix=granite-4.0-h-micro&encoding-type=url [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installing RHOAI 3.0.0 on OpenShift 4.20.x and adding LM‐Eval

0. Prerequisites

Note: If you have a GPU cluster:

If you want to run LM-Eval offlline, you need to have a pvc. This is how I did it on my 4.20.5 cluster

1. Install the Red Hat OpenShift AI Operator

2. Monitor DSCI

3. Install the Red Hat OpenShift AI components via DSC

4. Check that everything is running

6. Online lm-eval job

7. Offline Testing with Unitxt

8. Cleanup

Getting the dashboard up and running

Deploying a Model

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally