Skip to content

Commit 1e7dc77

Browse files
authored
setup instructions for RHOAI 2.14 (#89)
1 parent f87b4f7 commit 1e7dc77

17 files changed

+827
-2
lines changed

SETUP.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,11 @@ Instructions are provided for the following Red Hat OpenShift AI ***stable*** re
4242
+ [RHOAI 2.10 Uninstall](./setup.RHOAI-v2.10/UNINSTALL.md)
4343

4444
Instructions are provided for the following Red Hat OpenShift AI ***fast*** releases:
45+
+ Red Hat OpenShift AI 2.14
46+
+ [RHOAI 2.14 Cluster Setup](./setup.RHOAI-v2.14/CLUSTER-SETUP.md)
47+
+ [RHOAI 2.14 Team Setup](./setup.RHOAI-v2.14/TEAM-SETUP.md)
48+
+ [UPGRADING from RHOAI 2.13](./setup.RHOAI-v2.14/UPGRADE.md)
49+
+ [RHOAI 2.14 Uninstall](./setup.RHOAI-v2.14/UNINSTALL.md)
4550
+ Red Hat OpenShift AI 2.11
4651
+ [RHOAI 2.11 Cluster Setup](./setup.RHOAI-v2.11/CLUSTER-SETUP.md)
4752
+ [RHOAI 2.11 Team Setup](./setup.RHOAI-v2.11/TEAM-SETUP.md)

setup.RHOAI-v2.10/mlbatch-subscription.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ metadata:
245245
name: rhods-operator
246246
namespace: redhat-ods-operator
247247
spec:
248-
channel: stable-2.10
248+
channel: stable
249249
installPlanApproval: Manual
250250
name: rhods-operator
251251
source: redhat-operators

setup.RHOAI-v2.13/mlbatch-subscription.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,7 @@ metadata:
260260
name: rhods-operator
261261
namespace: redhat-ods-operator
262262
spec:
263-
channel: fast
263+
channel: stable
264264
installPlanApproval: Manual
265265
name: rhods-operator
266266
source: redhat-operators

setup.RHOAI-v2.14/CLUSTER-SETUP.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Cluster Setup
2+
3+
The cluster setup installs Red Hat OpenShift AI and Coscheduler, configures Kueue,
4+
cluster roles, and priority classes.
5+
6+
If MLBatch is deployed on a cluster that used to run earlier versions of ODH,
7+
[MCAD](https://github.com/project-codeflare/mcad), Red Hat OpenShift AI, or Coscheduler,
8+
make sure to scrub traces of these installations. In particular, make sure to
9+
delete the following custom resource definitions (CRD) if present on the
10+
cluster. Make sure to delete all instances prior to deleting the CRDs:
11+
```sh
12+
# Delete old appwrappers and crd
13+
oc delete appwrappers --all -A
14+
oc delete crd appwrappers.workload.codeflare.dev
15+
16+
# Delete old noderesourcetopologies and crd
17+
oc delete noderesourcetopologies --all -A
18+
oc delete crd noderesourcetopologies.topology.node.k8s.io
19+
```
20+
21+
## Priorities
22+
23+
Create `default-priority`, `high-priority`, and `low-priority` priority classes:
24+
```sh
25+
oc apply -f setup.RHOAI-v2.14/mlbatch-priorities.yaml
26+
```
27+
28+
## Coscheduler
29+
30+
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
31+
```sh
32+
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
33+
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
34+
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]'
35+
```
36+
Patch Coscheduler pod priorities:
37+
```sh
38+
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.14/coscheduler-priority-patch.yaml scheduler-plugins-controller
39+
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.14/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
40+
```
41+
42+
## Red Hat OpenShift AI
43+
44+
Create the Red Hat OpenShift AI subscription:
45+
```sh
46+
oc apply -f setup.RHOAI-v2.14/mlbatch-subscription.yaml
47+
````
48+
Identify install plan:
49+
```sh
50+
oc get ip -n redhat-ods-operator
51+
```
52+
```
53+
NAMESPACE NAME CSV APPROVAL APPROVED
54+
redhat-ods-operator install-kmh8w rhods-operator.2.10.0 Manual false
55+
```
56+
Approve install plan replacing the generated plan name below with the actual
57+
value:
58+
```sh
59+
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
60+
```
61+
Create DSC Initialization:
62+
```sh
63+
oc apply -f setup.RHOAI-v2.14/mlbatch-dsci.yaml
64+
```
65+
Create Data Science Cluster:
66+
```sh
67+
oc apply -f setup.RHOAI-v2.14/mlbatch-dsc.yaml
68+
```
69+
The provided DSCI and DSC are intended to install a minimal set of Red Hat OpenShift
70+
AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The
71+
remaining components such as `dashboard` can be optionally enabled.
72+
73+
The configuration of the managed components differs from the default Red Hat OpenShift
74+
AI configuration as follows:
75+
- Kubeflow Training Operator:
76+
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
77+
- Kueue:
78+
- `manageJobsWithoutQueueName` is enabled,
79+
- `batch/job` integration is disabled,
80+
- `waitForPodsReady` is disabled,
81+
- `LendingLimit` feature gate is enabled,
82+
- `enableClusterQueueResources` metrics is enabled,
83+
- Codeflare operator:
84+
- the AppWrapper controller is enabled and configured as follows:
85+
- `userRBACAdmissionCheck` is disabled,
86+
- `schedulerName` is set to `scheduler-plugins-scheduler`,
87+
- `queueName` is set to `default-queue`,
88+
- pod priorities, resource requests and limits have been adjusted.
89+
90+
91+
92+
## Kueue Configuration
93+
94+
Create Kueue's default flavor:
95+
```sh
96+
oc apply -f setup.RHOAI-v2.14/default-flavor.yaml
97+
```
98+
99+
## Cluster Role
100+
101+
Create `mlbatch-edit` role:
102+
```sh
103+
oc apply -f setup.RHOAI-v2.14/mlbatch-edit-role.yaml
104+
```
105+
106+
## Slack Cluster Queue
107+
108+
Create the designated slack `ClusterQueue` which will be used to automate
109+
minor adjustments to cluster capacity caused by node failures and
110+
scheduler maintanence.
111+
```sh
112+
oc apply -f- << EOF
113+
apiVersion: kueue.x-k8s.io/v1beta1
114+
kind: ClusterQueue
115+
metadata:
116+
name: slack-cluster-queue
117+
spec:
118+
namespaceSelector: {}
119+
cohort: default-cohort
120+
preemption:
121+
withinClusterQueue: LowerOrNewerEqualPriority
122+
reclaimWithinCohort: Any
123+
borrowWithinCohort:
124+
policy: Never
125+
resourceGroups:
126+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
127+
flavors:
128+
- name: default-flavor
129+
resources:
130+
- name: "cpu"
131+
nominalQuota: 8000m
132+
- name: "memory"
133+
nominalQuota: 128Gi
134+
- name: "nvidia.com/gpu"
135+
nominalQuota: 8
136+
- name: "nvidia.com/roce_gdr"
137+
nominalQuota: 1
138+
- name: "pods"
139+
nominalQuota: 100
140+
EOF
141+
```
142+
Edit the above quantities to adjust the quota to the desired
143+
values. Pod counts are optional and can be omitted from the list of
144+
covered resources. The `lendingLimit` for each resource will be
145+
dynamically adjusted by the MLBatch system to reflect reduced cluster
146+
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
147+
detailed discussion of the role of the slack `ClusterQueue`.

setup.RHOAI-v2.14/TEAM-SETUP.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Team Setup
2+
3+
A *team* in MLBatch is a group of users that share a resource quota.
4+
5+
Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md)
6+
for a discussion of our recommended best practices.
7+
8+
9+
Setting up a new team requires the cluster admin to create a project,
10+
a user group, a quota, a queue, and the required role bindings as described below.
11+
12+
Create project:
13+
```sh
14+
oc new-project team1
15+
```
16+
Create user group:
17+
```sh
18+
oc adm groups new team1-edit-group
19+
```
20+
Add users to group for example:
21+
```sh
22+
oc adm groups add-users team1-edit-group user1
23+
```
24+
Bind cluster role to group in namespace:
25+
```sh
26+
oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1
27+
```
28+
29+
Specify the intended quota for the namespace by creating a `ClusterQueue`:
30+
```sh
31+
oc apply -f- << EOF
32+
apiVersion: kueue.x-k8s.io/v1beta1
33+
kind: ClusterQueue
34+
metadata:
35+
name: team1-cluster-queue
36+
spec:
37+
namespaceSelector: {}
38+
cohort: default-cohort
39+
preemption:
40+
withinClusterQueue: LowerOrNewerEqualPriority
41+
reclaimWithinCohort: Any
42+
borrowWithinCohort:
43+
policy: Never
44+
resourceGroups:
45+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
46+
flavors:
47+
- name: default-flavor
48+
resources:
49+
- name: "cpu"
50+
nominalQuota: 8000m
51+
# borrowingLimit: 0
52+
# lendingLimit: 0
53+
- name: "memory"
54+
nominalQuota: 128Gi
55+
# borrowingLimit: 0
56+
# lendingLimit: 0
57+
- name: "nvidia.com/gpu"
58+
nominalQuota: 16
59+
# borrowingLimit: 0
60+
# lendingLimit: 0
61+
- name: "nvidia.com/roce_gdr"
62+
nominalQuota: 4
63+
# borrowingLimit: 0
64+
# lendingLimit: 0
65+
- name: "pods"
66+
nominalQuota: 100
67+
# borrowingLimit: 0
68+
# lendingLimit: 0
69+
EOF
70+
```
71+
Edit the above quantities to adjust the quota to the desired values. Pod counts
72+
are optional and can be omitted from the list of covered resources.
73+
74+
Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing
75+
quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other
76+
namespaces from borrowing quota from this namespace.
77+
78+
Create a `LocalQueue` to bind the `ClusterQueue` to the namespace:
79+
```sh
80+
oc apply -n team1 -f- << EOF
81+
apiVersion: kueue.x-k8s.io/v1beta1
82+
kind: LocalQueue
83+
metadata:
84+
name: default-queue
85+
spec:
86+
clusterQueue: team1-cluster-queue
87+
EOF
88+
```
89+
We recommend naming the local queue `default-queue` as `AppWrappers` will
90+
default to this queue name.
91+

setup.RHOAI-v2.14/UNINSTALL.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Uninstall
2+
3+
***First, remove all team projects and corresponding cluster queues.***
4+
5+
Then to uninstall the MLBatch controllers and reclaim the corresponding
6+
namespaces, run:
7+
```sh
8+
# OpenShift AI uninstall
9+
oc delete dsc mlbatch-dsc
10+
oc delete dsci mlbatch-dsci
11+
oc delete subscription -n redhat-ods-operator rhods-operator
12+
oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
13+
oc delete crd featuretrackers.features.opendatahub.io \
14+
dscinitializations.dscinitialization.opendatahub.io \
15+
datascienceclusters.datasciencecluster.opendatahub.io
16+
oc delete operators rhods-operator.redhat-ods-operator
17+
oc delete operatorgroup -n redhat-ods-operator rhods-operator
18+
oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator
19+
20+
# Coscheduler uninstall
21+
helm uninstall -n scheduler-plugins scheduler-plugins
22+
oc delete namespace scheduler-plugins
23+
```

setup.RHOAI-v2.14/UPGRADE.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Upgrading from RHOAI 2.13
2+
3+
These instructions assume you installed and configured RHOAI 2.13 following
4+
the MLBatch [install instructions for RHOAI-v2.13](../setup.RHOAI-v2.13/CLUSTER-SETUP.md)
5+
and are subscribed to the fast channel.
6+
7+
Your subscription will have automatically created an unapproved
8+
install plan to upgrade to RHOAI 2.14.
9+
10+
Before beginning, verify that the expected install plan exists:
11+
```sh
12+
oc get ip -n redhat-ods-operator
13+
```
14+
Typical output would be:
15+
```sh
16+
NAME CSV APPROVAL APPROVED
17+
install-kpzzl rhods-operator.2.14.0 Manual false
18+
install-nqrbp rhods-operator.2.13.0 Manual true
19+
```
20+
21+
Assuming the install plan exists you can begin the upgrade process.
22+
23+
There are no MLBatch modifications to the default RHOAI configuration maps
24+
beyond those already made in previous installs. Therefore, you can simply
25+
approve the install plan replacing the example plan name below with the actual
26+
value on your cluster:
27+
```sh
28+
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
29+
```
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
- op: add
2+
path: /spec/template/spec/priorityClassName
3+
value: system-node-critical
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: kueue.x-k8s.io/v1beta1
2+
kind: ResourceFlavor
3+
metadata:
4+
name: default-flavor

setup.RHOAI-v2.14/mlbatch-dsc.yaml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
apiVersion: datasciencecluster.opendatahub.io/v1
2+
kind: DataScienceCluster
3+
metadata:
4+
name: mlbatch-dsc
5+
spec:
6+
components:
7+
codeflare:
8+
managementState: Managed
9+
dashboard:
10+
managementState: Removed
11+
datasciencepipelines:
12+
managementState: Removed
13+
kserve:
14+
managementState: Removed
15+
serving:
16+
ingressGateway:
17+
certificate:
18+
type: SelfSigned
19+
managementState: Removed
20+
name: knative-serving
21+
kueue:
22+
managementState: Managed
23+
modelmeshserving:
24+
managementState: Removed
25+
ray:
26+
managementState: Managed
27+
trainingoperator:
28+
managementState: Managed
29+
trustyai:
30+
managementState: Removed
31+
workbenches:
32+
managementState: Removed

0 commit comments

Comments
 (0)