project-codeflare
diff --git a/‎SETUP.md‎
Lines changed: 5 additions & 0 deletions b/‎SETUP.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.10/mlbatch-subscription.yaml‎
Lines changed: 1 addition & 1 deletion b/‎setup.RHOAI-v2.10/mlbatch-subscription.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎setup.RHOAI-v2.13/mlbatch-subscription.yaml‎
Lines changed: 1 addition & 1 deletion b/‎setup.RHOAI-v2.13/mlbatch-subscription.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎setup.RHOAI-v2.14/CLUSTER-SETUP.md‎
Lines changed: 147 additions & 0 deletions b/‎setup.RHOAI-v2.14/CLUSTER-SETUP.md‎
Lines changed: 147 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.14/TEAM-SETUP.md‎
Lines changed: 91 additions & 0 deletions b/‎setup.RHOAI-v2.14/TEAM-SETUP.md‎
Lines changed: 91 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.14/UNINSTALL.md‎
Lines changed: 23 additions & 0 deletions b/‎setup.RHOAI-v2.14/UNINSTALL.md‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.14/UPGRADE.md‎
Lines changed: 29 additions & 0 deletions b/‎setup.RHOAI-v2.14/UPGRADE.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.14/coscheduler-priority-patch.yaml‎
Lines changed: 3 additions & 0 deletions b/‎setup.RHOAI-v2.14/coscheduler-priority-patch.yaml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.14/default-flavor.yaml‎
Lines changed: 4 additions & 0 deletions b/‎setup.RHOAI-v2.14/default-flavor.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.14/mlbatch-dsc.yaml‎
Lines changed: 32 additions & 0 deletions b/‎setup.RHOAI-v2.14/mlbatch-dsc.yaml‎
Lines changed: 32 additions & 0 deletions
@@ -42,6 +42,11 @@ Instructions are provided for the following Red Hat OpenShift AI ***stable*** re
    + [RHOAI 2.10 Uninstall](./setup.RHOAI-v2.10/UNINSTALL.md)
 
 Instructions are provided for the following Red Hat OpenShift AI ***fast*** releases:
++ Red Hat OpenShift AI 2.14
+   + [RHOAI 2.14 Cluster Setup](./setup.RHOAI-v2.14/CLUSTER-SETUP.md)
+   + [RHOAI 2.14 Team Setup](./setup.RHOAI-v2.14/TEAM-SETUP.md)
+   + [UPGRADING from RHOAI 2.13](./setup.RHOAI-v2.14/UPGRADE.md)
+   + [RHOAI 2.14 Uninstall](./setup.RHOAI-v2.14/UNINSTALL.md)
 + Red Hat OpenShift AI 2.11
    + [RHOAI 2.11 Cluster Setup](./setup.RHOAI-v2.11/CLUSTER-SETUP.md)
    + [RHOAI 2.11 Team Setup](./setup.RHOAI-v2.11/TEAM-SETUP.md)
 
@@ -245,7 +245,7 @@ metadata:
   name: rhods-operator
   namespace: redhat-ods-operator
 spec:
-  channel: stable-2.10
+  channel: stable
   installPlanApproval: Manual
   name: rhods-operator
   source: redhat-operators
 
@@ -260,7 +260,7 @@ metadata:
   name: rhods-operator
   namespace: redhat-ods-operator
 spec:
-  channel: fast
+  channel: stable
   installPlanApproval: Manual
   name: rhods-operator
   source: redhat-operators
 
@@ -0,0 +1,147 @@
+# Cluster Setup
+
+The cluster setup installs Red Hat OpenShift AI and Coscheduler, configures Kueue,
+cluster roles, and priority classes.
+
+If MLBatch is deployed on a cluster that used to run earlier versions of ODH,
+[MCAD](https://github.com/project-codeflare/mcad), Red Hat OpenShift AI, or Coscheduler,
+make sure to scrub traces of these installations. In particular, make sure to
+delete the following custom resource definitions (CRD) if present on the
+cluster. Make sure to delete all instances prior to deleting the CRDs:
+```sh
+# Delete old appwrappers and crd
+oc delete appwrappers --all -A
+oc delete crd appwrappers.workload.codeflare.dev
+
+# Delete old noderesourcetopologies and crd
+oc delete noderesourcetopologies --all -A
+oc delete crd noderesourcetopologies.topology.node.k8s.io
+```
+
+## Priorities
+
+Create `default-priority`, `high-priority`, and `low-priority` priority classes:
+```sh
+oc apply -f setup.RHOAI-v2.14/mlbatch-priorities.yaml
+```
+
+## Coscheduler
+
+Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
+```sh
+helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
+  scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
+  --set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]'
+```
+Patch Coscheduler pod priorities:
+```sh
+oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.14/coscheduler-priority-patch.yaml scheduler-plugins-controller
+oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.14/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
+```
+
+## Red Hat OpenShift AI
+
+Create the Red Hat OpenShift AI subscription:
+```sh
+oc apply -f setup.RHOAI-v2.14/mlbatch-subscription.yaml
+````
+Identify install plan:
+```sh
+oc get ip -n redhat-ods-operator
+```
+```
+NAMESPACE             NAME            CSV                     APPROVAL   APPROVED
+redhat-ods-operator   install-kmh8w   rhods-operator.2.10.0   Manual     false
+```
+Approve install plan replacing the generated plan name below with the actual
+value:
+```sh
+oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
+```
+Create DSC Initialization:
+```sh
+oc apply -f setup.RHOAI-v2.14/mlbatch-dsci.yaml
+```
+Create Data Science Cluster:
+```sh
+oc apply -f setup.RHOAI-v2.14/mlbatch-dsc.yaml
+```
+The provided DSCI and DSC are intended to install a minimal set of Red Hat OpenShift
+AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The
+remaining components such as `dashboard` can be optionally enabled.
+
+The configuration of the managed components differs from the default Red Hat OpenShift
+AI configuration as follows:
+- Kubeflow Training Operator:
+  - `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
+- Kueue:
+  - `manageJobsWithoutQueueName` is enabled,
+  - `batch/job` integration is disabled,
+  - `waitForPodsReady` is disabled,
+  - `LendingLimit` feature gate is enabled,
+  - `enableClusterQueueResources` metrics is enabled,
+- Codeflare operator:
+  - the AppWrapper controller is enabled and configured as follows:
+    - `userRBACAdmissionCheck` is disabled,
+    - `schedulerName` is set to `scheduler-plugins-scheduler`,
+    - `queueName` is set to `default-queue`,
+- pod priorities, resource requests and limits have been adjusted.
+
+
+
+## Kueue Configuration
+
+Create Kueue's default flavor:
+```sh
+oc apply -f setup.RHOAI-v2.14/default-flavor.yaml
+```
+
+## Cluster Role
+
+Create `mlbatch-edit` role:
+```sh
+oc apply -f setup.RHOAI-v2.14/mlbatch-edit-role.yaml
+```
+
+## Slack Cluster Queue
+
+Create the designated slack `ClusterQueue` which will be used to automate
+minor adjustments to cluster capacity caused by node failures and
+scheduler maintanence.
+```sh
+oc apply -f- << EOF
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ClusterQueue
+metadata:
+  name: slack-cluster-queue
+spec:
+  namespaceSelector: {}
+  cohort: default-cohort
+  preemption:
+    withinClusterQueue: LowerOrNewerEqualPriority
+    reclaimWithinCohort: Any
+    borrowWithinCohort:
+      policy: Never
+  resourceGroups:
+  - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
+    flavors:
+    - name: default-flavor
+      resources:
+      - name: "cpu"
+        nominalQuota: 8000m
+      - name: "memory"
+        nominalQuota: 128Gi
+      - name: "nvidia.com/gpu"
+        nominalQuota: 8
+      - name: "nvidia.com/roce_gdr"
+        nominalQuota: 1
+      - name: "pods"
+        nominalQuota: 100
+EOF
+```
+Edit the above quantities to adjust the quota to the desired
+values. Pod counts are optional and can be omitted from the list of
+covered resources.  The `lendingLimit` for each resource will be
+dynamically adjusted by the MLBatch system to reflect reduced cluster
+capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
+detailed discussion of the role of the slack `ClusterQueue`.
@@ -0,0 +1,91 @@
+# Team Setup
+
+A *team* in MLBatch is a group of users that share a resource quota.
+
+Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md)
+for a discussion of our recommended best practices.
+
+
+Setting up a new team requires the cluster admin to create a project,
+a user group, a quota, a queue, and the required role bindings as described below.
+
+Create project:
+```sh
+oc new-project team1
+```
+Create user group:
+```sh
+oc adm groups new team1-edit-group
+```
+Add users to group for example:
+```sh
+oc adm groups add-users team1-edit-group user1
+```
+Bind cluster role to group in namespace:
+```sh
+oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1
+```
+
+Specify the intended quota for the namespace by creating a `ClusterQueue`:
+```sh
+oc apply -f- << EOF
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ClusterQueue
+metadata:
+  name: team1-cluster-queue
+spec:
+  namespaceSelector: {}
+  cohort: default-cohort
+  preemption:
+    withinClusterQueue: LowerOrNewerEqualPriority
+    reclaimWithinCohort: Any
+    borrowWithinCohort:
+      policy: Never
+  resourceGroups:
+  - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
+    flavors:
+    - name: default-flavor
+      resources:
+      - name: "cpu"
+        nominalQuota: 8000m
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "memory"
+        nominalQuota: 128Gi
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "nvidia.com/gpu"
+        nominalQuota: 16
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "nvidia.com/roce_gdr"
+        nominalQuota: 4
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "pods"
+        nominalQuota: 100
+        # borrowingLimit: 0
+        # lendingLimit: 0
+EOF
+```
+Edit the above quantities to adjust the quota to the desired values. Pod counts
+are optional and can be omitted from the list of covered resources.
+
+Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing
+quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other
+namespaces from borrowing quota from this namespace.
+
+Create a `LocalQueue` to bind the `ClusterQueue` to the namespace:
+```sh
+oc apply -n team1 -f- << EOF
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: LocalQueue
+metadata:
+  name: default-queue
+spec:
+  clusterQueue: team1-cluster-queue
+EOF
+```
+We recommend naming the local queue `default-queue` as `AppWrappers` will
+default to this queue name.
+
@@ -0,0 +1,23 @@
+# Uninstall
+
+***First, remove all team projects and corresponding cluster queues.***
+
+Then to uninstall the MLBatch controllers and reclaim the corresponding
+namespaces, run:
+```sh
+# OpenShift AI uninstall
+oc delete dsc mlbatch-dsc
+oc delete dsci mlbatch-dsci
+oc delete subscription -n redhat-ods-operator rhods-operator
+oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
+oc delete crd featuretrackers.features.opendatahub.io \
+  dscinitializations.dscinitialization.opendatahub.io \
+  datascienceclusters.datasciencecluster.opendatahub.io
+oc delete operators rhods-operator.redhat-ods-operator
+oc delete operatorgroup -n redhat-ods-operator rhods-operator
+oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator
+
+# Coscheduler uninstall
+helm uninstall -n scheduler-plugins scheduler-plugins
+oc delete namespace scheduler-plugins
+```
@@ -0,0 +1,29 @@
+# Upgrading from RHOAI 2.13
+
+These instructions assume you installed and configured RHOAI 2.13 following
+the MLBatch [install instructions for RHOAI-v2.13](../setup.RHOAI-v2.13/CLUSTER-SETUP.md)
+and are subscribed to the fast channel.
+
+Your subscription will have automatically created an unapproved
+install plan to upgrade to RHOAI 2.14.
+
+Before beginning, verify that the expected install plan exists:
+```sh
+oc get ip -n redhat-ods-operator
+```
+Typical output would be:
+```sh
+NAME            CSV                     APPROVAL   APPROVED
+install-kpzzl   rhods-operator.2.14.0   Manual     false
+install-nqrbp   rhods-operator.2.13.0   Manual     true
+```
+
+Assuming the install plan exists you can begin the upgrade process.
+
+There are no MLBatch modifications to the default RHOAI configuration maps
+beyond those already made in previous installs. Therefore, you can simply
+approve the install plan replacing the example plan name below with the actual
+value on your cluster:
+```sh
+oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
+```
@@ -0,0 +1,3 @@
+- op: add
+  path: /spec/template/spec/priorityClassName
+  value: system-node-critical
@@ -0,0 +1,4 @@
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ResourceFlavor
+metadata:
+  name: default-flavor
@@ -0,0 +1,32 @@
+apiVersion: datasciencecluster.opendatahub.io/v1
+kind: DataScienceCluster
+metadata:
+  name: mlbatch-dsc
+spec:
+  components:
+    codeflare:
+      managementState: Managed
+    dashboard:
+      managementState: Removed
+    datasciencepipelines:
+      managementState: Removed
+    kserve:
+      managementState: Removed
+      serving:
+        ingressGateway:
+          certificate:
+            type: SelfSigned
+        managementState: Removed
+        name: knative-serving
+    kueue:
+      managementState: Managed
+    modelmeshserving:
+      managementState: Removed
+    ray:
+      managementState: Managed
+    trainingoperator:
+      managementState: Managed
+    trustyai:
+      managementState: Removed
+    workbenches:
+      managementState: Removed
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+- op: add`
	`2`	`+ path: /spec/template/spec/priorityClassName`
	`3`	`+ value: system-node-critical`