-
Notifications
You must be signed in to change notification settings - Fork 2
Installing RHOAI 3.0.0 on OpenShift 4.20.x and adding LM‐Eval
Refer to the Red Hat docs here for more detail:
Table of Contents
- Prerequisites
- Monitor DSCI
- Install the Red Hat OpenShift AI components via DSC
- Check that everything is running
- TBD - Kueue setup
- Online LM-Eval Job
- Offline testing with unitxt
- Cleanup
- Dashboard
0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.20.5. I don't know if 3.0.0 works with any OC version older...)
0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.
0.3 Also logged into the terminal with oc login: For example:
oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443
0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html
If you want to run LM-Eval offlline, you need to have a pvc. This is how I did it on my 4.20.5 cluster
0.5 Install Portworx with these steps: 0.5.1 Login to the openshift console and click on:
Ecosystem --> Software catalog
and search for portworx and click install with all the defaults.
0.5.2 Approve the install
0.5.3 Click on "Create StorageCluster" and accept the defaults, click Create at the bottom.
0.5.4 Watch the status of the portworx storage cluster in the console, or you can issue a command line query like this:
watch oc get storagecluster,pods -n openshift-operators
Which will return:
NAME CLUSTER UUID STATUS VERSION AGE
portworx Initializing 3.5.0 102s
0.5.5 Configure one of the existing storage classes as default. Let's try this:
oc patch sc px-csi-db -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
0.5.6 Create the test namespace and switch to it:
oc create namespace test
oc project test
0.5.7 Now make a pvc and see if it works:
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml
0.5.8 Check the pvc
oc get pvc -n test
1.1 Create a namespace and OperatorGroup:
cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: redhat-ods-operator
EOF
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: rhods-operator
namespace: redhat-ods-operator
EOF
1.2 Create a subscription (Recommend changing installPlanApproval to Manual in production)
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/rhods-operator.redhat-ods-operator: ""
name: rhods-operator
namespace: redhat-ods-operator
spec:
channel: fast-3.x
installPlanApproval: Automatic
name: rhods-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: rhods-operator.3.0.0
EOF
And watch that it starts:
watch oc get pods,csv -n redhat-ods-operator
Watch the dsci until it's complete:
watch oc get dsci
and it'll finish up like this:
NAME AGE PHASE CREATED AT
default-dsci 16m Ready 2024-07-02T19:56:18Z
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v2
kind: DataScienceCluster
metadata:
creationTimestamp: "2025-12-10T23:25:57Z"
generation: 1
labels:
app.kubernetes.io/name: datasciencecluster
name: default-dsc
resourceVersion: "82083"
uid: 55368194-f183-4662-8477-157532cbd1e3
spec:
components:
aipipelines:
argoWorkflowsControllers:
managementState: Managed
managementState: Managed
dashboard:
managementState: Managed
feastoperator:
managementState: Removed
kserve:
managementState: Managed
nim:
managementState: Managed
rawDeploymentServiceConfig: Headless
kueue:
defaultClusterQueueName: default
defaultLocalQueueName: default
managementState: Removed
llamastackoperator:
managementState: Removed
modelregistry:
managementState: Managed
registriesNamespace: rhoai-model-registries
ray:
managementState: Managed
trainingoperator:
managementState: Managed
trustyai:
eval:
lmeval:
permitCodeExecution: allow
permitOnline: allow
managementState: Managed
workbenches:
managementState: Managed
workbenchNamespace: rhods-notebooks
EOF
4.1 Check that your operators are running:
oc get pods -n redhat-ods-operator
Will return:
NAME READY STATUS RESTARTS AGE
rhods-operator-7c54d9d6b5-j97mv 1/1 Running 0 22h
4.2 Check that the DSC components are running:
watch oc get pods -n redhat-ods-applications
Will return:
NAME READY STATUS RESTARTS AGE
data-science-pipelines-operator-controller-manager-575f788ht6h4 1/1 Running 0 5m
kserve-controller-manager-64b497ccdd-2b276 1/1 Running 0 4m40s
kubeflow-training-operator-64664c57b-6qb9p 1/1 Running 0 4m51s
kuberay-operator-66c9dc86f6-hlcwd 1/1 Running 0 4m44s
model-registry-operator-controller-manager-55cdf79bc6-5vxq5 1/1 Running 0 4m59s
notebook-controller-deployment-6697968bbf-9clnc 1/1 Running 0 4m56s
odh-model-controller-58dfc575d7-6cvhh 1/1 Running 0 5m
odh-notebook-controller-manager-6c65f4d46f-ntl9m 1/1 Running 0 4m58s
rhods-dashboard-6c8744b89-f9ggh 4/4 Running 0 4m55s
rhods-dashboard-6c8744b89-t7f8j 4/4 Running 0 4m55s
trustyai-service-operator-controller-manager-58d9bc5459-4p8g9 1/1 Running 0 4m41s
See this Getting started with LM-Eval article for the latest info
6.1 Create a test namespace. For all the jobs, you need to run them in a namespace other than default for the moment.
oc create namespace test
oc project test
6.2. Submit a sample LM-Eval job
cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "online-lmeval-glue"
namespace: test
spec:
allowOnline: true
allowCodeExecution: true
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- card:
name: "cards.wnli"
#template: "templates.classification.multi_class.relation.default"
template:
name: "templates.classification.multi_class.relation.default"
logSamples: true
EOF
And then watch that it starts and runs:
watch oc get pods,lmevaljobs -n test
And once it pulls the image and runs for about 5 minutes it should look like this:
oc get pods,lmevaljobs -n test api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024
NAME READY STATUS RESTARTS AGE
pod/online-lmeval-glue 1/1 Running 0 25s
NAME STATE
lmevaljob.trustyai.opendatahub.io/lmeval-glue Running
To clean it up, do this:
oc delete lmevaljob online-lmeval-glue -n test
Another test is "Online-unitxt" you can try:
cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "online-unitxt"
namespace: test
spec:
allowOnline: true
model: hf
modelArgs:
- name: pretrained
value: "google/flan-t5-base"
taskList:
taskRecipes:
- card:
name: "cards.20_newsgroups_short"
#template: "templates.classification.multi_class.title"
template:
name: "templates.classification.multi_class.title"
logSamples: true
EOF
And then watch that it starts and runs:
watch oc get pods,lmevaljobs -n test
To clean it up, do this:
oc delete lmevaljob online-unitxt -n test
7.0.1 Create a test namespace
oc create namespace test
oc project test
7.0.2 Create a pvc to contain the offline models/etc.
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/pvc.yaml
7.0.3 Check the pvc
oc get pvc -n test
And it should look something like this:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
lmeval-data Bound pvc-6ff1abbf-a995-459a-8e9d-f98e5bf1c2ae 20Gi RWO portworx-watson-assistant-sc 29s
7.1 UNITEXT Testing
7.1.1 Deploy a Pod that will copy the models and datasets to the PVC:
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/disconnected-flan-arceasy.yaml
7.1.2 Check for when it's complete
watch oc get pods -n test
7.1.3 Delete the lmeval-downloader pod once it's complete:
oc delete pod lmeval-downloader -n test
7.1.4 Apply the yaml for the ARCEasy
cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "offline-lmeval-arceasy-test"
labels:
opendatahub.io/dashboard: "true"
lmevaltests: "vllm"
spec:
model: hf
modelArgs:
- name: pretrained
value: "/opt/app-root/src/hf_home/flan"
taskList:
taskNames:
- "arc_easy"
logSamples: true
offline:
storage:
pvcName: "lmeval-data"
pod:
container:
env:
- name: HF_HUB_VERBOSITY
value: "debug"
- name: UNITXT_DEFAULT_VERBOSITY
value: "debug"
EOF
It should start up a pod
watch oc get pods -n test
and it'll look like this:
offline-lmeval-arceasy-test 0/1 Completed 0 14m
7.1.5 Even though the pod is done, it doesn't release the pvc, so the next test might fail if the first pod still exists and the loader job uses a different worker node. Therefore you should delete the first offline job once you're sure it's completed.
oc delete lmevaljob offline-lmeval-arceasy-test -n test
7.2 Testing UNITXT (This takes awhile)
7.2.1 Using the same pvc as above, apply the unitxt loader
oc apply -f https://raw.githubusercontent.com/trustyai-explainability/reference/refs/heads/main/lm-eval/resources/downloader-flan-20newsgroups.yaml
7.2.2 Check for when it's complete
watch oc get pods -n test
7.2.3 Delete the lmeval-downloader pod
oc delete pod lmeval-downloader -n test
7.2.4 Apply the unitxt yml
cat << EOF | oc apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: "offline-lmeval-unitxt-test"
spec:
model: hf
modelArgs:
- name: pretrained
value: "/opt/app-root/src/hf_home/flan"
taskList:
taskRecipes:
- card:
name: "cards.20_newsgroups_short"
#template: "templates.classification.multi_class.title"
template:
name: "templates.classification.multi_class.title"
logSamples: true
offline:
storage:
pvcName: "lmeval-data"
pod:
container:
env:
- name: HF_HUB_VERBOSITY
value: "debug"
- name: UNITXT_DEFAULT_VERBOSITY
value: "debug"
EOF
And the pod should start up:
watch oc get pods -n test
And it'll look like this eventually when it's done:
offline-lmeval-unitxt-test 0/1 Completed 0 14m
8.1 Cleanup of your lmevaljob(s), for example
oc delete lmevaljob evaljob-sample -n test
oc delete lmevaljob offline-lmeval-arceasy-test offline-lmeval-unitxt-test -n test
8.2 Cleanup of your Kueue resouces, if you want that:
oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2
8.3 Cleanup of dsc items (if you want that)
oc delete dsc default-dsc
8.4 Cleanup of DSCI (if you want that)
oc delete dsci default-dsci
8.5 Cleanup of the Operators (if you want that)
oc delete sub servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete sub authorino-operator -n openshift-operators
oc delete sub kiali-ossm -n openshift-operators
oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv servicemeshoperator.v2.6.11 -n openshift-operators
oc delete csv serverless-operator.v1.36.1 -n openshift-serverless
oc delete csv authorino-operator.v1.2.3 kiali-operator.v2.11.4 -n redhat-ods-applications-auth-provider
oc delete crd servicemeshcontrolplanes.maistra.io servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io servicemeshpolicies.authentication.maistra.io servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io
oc delete csv rhods-operator.2.25.0 -n redhat-ods-operator
8.6 Cleanup of the operatorgroup
oc delete OperatorGroup rhods-operator -n redhat-ods-operator
Note, I'm getting info about this from: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/monitoring_data_science_models/evaluating-large-language-models_monitor#performing-model-evaluations-in-the-dashboard_monitor
- oc edit dsc and change dashboard to managed like this:
oc edit dsc default-dsc
and change the dashboard to managed:
dashboard:
managementState: Managed
Also change these things to managed too:
codeflare:
datasciencepipelines:
Managed
ray:
trainingoperator:
workbenches:
kueue:
modelmeshserving:
datasciencepipelines:
- edit the dashboard cr
oc edit odhdashboardconfig odh-dashboard-config -n redhat-ods-applications
and add this line in the dahboardConfig spec:
disableLMEval: false
- Note you might have to restart the dashboard pods after this, I'm not sure
oc get pods -n redhat-ods-applications
- Get the route for the dashboard:
oc get route -n redhat-ods-applications
and then put this host url into the in the browser, for example
http://rhods-dashboard-redhat-ods-applications.apps.jim416.cp.fyre.ibm.com
-
Click on Models --> Model evaluation runs
-
Install trustyaiservice in your project. I'm going to do it to "test"
-
I've done all the steps to get RHOAI and the dashboard up.
-
I'm watching this video Guide to Deploying AI Models on Red Hat OpenShift AI
2.1 oc login into the server on the cli
https://github.com/IsaiahStapleton/rhoai-model-deployment-guide
2.2 On the server
mkdir JIM ; cd JIM
git clone https://github.com/IsaiahStapleton/rhoai-model-deployment-guide.git
cd rhoai-model-deployment-guide/
oc create project test ; oc project test
2.3 Apply minio
oc apply -f minio-setup.yaml
and check it:
oc get pods,routes
- Download a model
3.1 go to hugging face and find a good granite model
https://huggingface.co/ibm-granite/granite-4.0-h-micro/tree/main
3.2 First you need to login to Hugging face and generate an access token:
1. Navigate to settings -> access tokens
2. Select create new token
3. For token type, select Read and then give it a name
4. Copy the token
- Download the model on your laptop
cd /Users/jamesbusche/projects/LM-EVAL
git clone https://jbusche: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@huggingface.co/ibm-granite/granite-4.0-h-micro
- Upload the model to the S3 Minio storage
5.1 Login to the mini ui from oc get routes -n test
oc get route -n test |grep minio-ui
minio-ui minio-ui-test.apps.jim4162.cp.fyre.ibm.com minio-service ui edge/Redirect None
Put in the browser the output, for example: https://minio-ui-test.apps.jim4162.cp.fyre.ibm.com/ user = minio password = mino123
5.2 Create a bucket with a name. He suggestions "models"
5.3 Click on Upload Folder. Pick the laptop folder where hugging face was downloaded.
folder would be: /Users/jamesbusche/projects/LM-EVAL/granite-4.0-h-micro
-
Deploying the model on RHOAI
6.1 Login to your RHOAI dashboard
oc get route -n redhat-ods-applications
https://rhods-dashboard-redhat-ods-applications.apps.jim4162.cp.fyre.ibm.com
kubeadmin/PTUiG-k6chu-GCFrS-mS8XL
6.2 within your Data Science project (test), navigate to connections and select create a connection
6.3 fill in the following values:
get the endpoint from:
oc get route -n test |grep minio-api
minio-api minio-api-test.apps.jim4162.cp.fyre.ibm.com minio-service api edge/Redirect None
Connection type: S3 compatible object storage - v1
Connection name: My Data Connection
Access key: minio
secret key: minio123
Endpoint: https://minio-ui-test.apps.jim4162.cp.fyre.ibm.com
Region: (leaving blank)
Bucket: models
7. Deploy your model
7.1 Navigate to models under your project to deploy a model
• Model deployment name: demo-granite
• Serving runtime: vLLM ServingRuntime for KServe
• Model server size: You can select whatever size you wish, for this guide I will keep the small size
• Accelerator: Select NVIDIA GPU
• Model route: Select check box for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster
• Token authentication: Select check box for "Require token authentication" this makes it so that sending requests to the model endpoint requires a token, which is important for security. You can leave the service account name as default-name
• Source model location: Select the data connection that you set up in step 4.1 (My Data Connection). Then provide it the path to your model. (granite-4.0-h-micro)
7.2 Hit Deploy
8. I can see that it is starting…
oc get pods
NAME READY STATUS RESTARTS AGE
demo-granite-predictor-668b5b7c47-bkx76 0/2 Init:Error 2 (31s ago) 74s
minio-6fbc45498-cj8th 1/1 Running 0 20h
I'm getting this:
oc logs -f demo-granite-predictor-668b5b7c47-bkx76 -c storage-initializer
botocore.exceptions.SSLError: SSL validation failed for https://minio-api-test.apps.jim4162.cp.fyre.ibm.com/models?prefix=granite-4.0-h-micro&encoding-type=url [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)