|
| 1 | +# MLBatch Setup |
| 2 | + |
| 3 | +The MLBatch setup consists of a [cluster setup](#cluster-setup) to be done once |
| 4 | +and a [project setup](#project-setup) to be repeated for each team. This |
| 5 | +document also discusses [quota maintenance](#quota-maintenance). |
| 6 | + |
| 7 | +Batch users should only be permitted to create AppWrappers or workloads whose |
| 8 | +types are natively supported by Kueue. The provided `mlbatch-edit` role permits |
| 9 | +the creation of `PyTorchJobs`, `RayJobs`, `RayClusters`, and `AppWrappers`. |
| 10 | +Kueue at this time has no mechanism for granular quota enforcement for `Jobs`, |
| 11 | +i.e., no mechanism to enforce quotas only on user-submitted `Jobs` without |
| 12 | +impacting OpenShift-internal `Jobs`. As a consequence, MLBatch disables queuing |
| 13 | +and quota management for `Jobs` and the `mlbatch-edit` role does not give |
| 14 | +permission to create `Jobs`. While `Jobs`, or `Pods` and `Deployments` cannot be |
| 15 | +created by MLBatch users directly, `AppWrappers` can easily wrap and bundle |
| 16 | +resources of these types. See [USAGE.md](USAGE.md) for examples. |
| 17 | + |
| 18 | +This setup has been developed on OpenShift 4.14 is intended to support OpenShift |
| 19 | +4.12 and up. |
| 20 | + |
| 21 | +To start with, recursively clone and enter this repository: |
| 22 | +```sh |
| 23 | +git clone --recursive https://github.com/project-codeflare/mlbatch.git |
| 24 | +cd mlbatch |
| 25 | +``` |
| 26 | + |
| 27 | +## Cluster Setup |
| 28 | + |
| 29 | +The cluster setup installs OpenShift AI and Coscheduler, configures Kueue, |
| 30 | +cluster roles, and priority classes. |
| 31 | + |
| 32 | +If MLBatch is deployed on a cluster that used to run earlier versions of ODH, |
| 33 | +[MCAD](https://github.com/project-codeflare/mcad), OpenShift AI, or Coscheduler, |
| 34 | +make sure to scrub traces of these installations. In particular, make sure to |
| 35 | +delete the following custom resource definitions (CRD) if present on the |
| 36 | +cluster. Make sure to delete all instances prior to deleting the CRDs: |
| 37 | +```sh |
| 38 | +# Delete old appwrappers and crd |
| 39 | +oc delete appwrappers --all -A |
| 40 | +oc delete crd appwrappers.workload.codeflare.dev |
| 41 | + |
| 42 | +# Delete old noderesourcetopologies and crd |
| 43 | +oc delete noderesourcetopologies --all -A |
| 44 | +oc delete crd noderesourcetopologies.topology.node.k8s.io |
| 45 | +``` |
| 46 | + |
| 47 | +### Priorities |
| 48 | + |
| 49 | +Create `default-priority`, `high-priority`, and `low-priority` priority classes: |
| 50 | +```sh |
| 51 | +oc apply -f setup/mlbatch-priorities.yaml |
| 52 | +``` |
| 53 | + |
| 54 | +### Coscheduler |
| 55 | + |
| 56 | +Install Coscheduler v0.28.9 as a secondary scheduler and configure packing: |
| 57 | +```sh |
| 58 | +helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \ |
| 59 | + scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \ |
| 60 | + --set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]' |
| 61 | +``` |
| 62 | +Patch Coscheduler pod priorities: |
| 63 | +```sh |
| 64 | +oc patch deployment -n scheduler-plugins --type=json --patch-file setup/coscheduler-priority-patch.yaml scheduler-plugins-controller |
| 65 | +oc patch deployment -n scheduler-plugins --type=json --patch-file setup/coscheduler-priority-patch.yaml scheduler-plugins-scheduler |
| 66 | +``` |
| 67 | + |
| 68 | +### OpenShift AI |
| 69 | + |
| 70 | +Create OpenShift AI 2.10 subscription: |
| 71 | +```sh |
| 72 | +oc apply -f setup/mlbatch-subscription.yaml |
| 73 | +```` |
| 74 | +Identify install plan: |
| 75 | +```sh |
| 76 | +oc get ip -n redhat-ods-operator |
| 77 | +``` |
| 78 | +``` |
| 79 | +NAMESPACE NAME CSV APPROVAL APPROVED |
| 80 | +redhat-ods-operator install-kmh8w rhods-operator.2.10.0 Manual false |
| 81 | +``` |
| 82 | +Approve install plan replacing the generated plan name below with the actual |
| 83 | +value: |
| 84 | +```sh |
| 85 | +oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w |
| 86 | +``` |
| 87 | +Create DSC Initialization: |
| 88 | +```sh |
| 89 | +oc apply -f setup/mlbatch-dsci.yaml |
| 90 | +``` |
| 91 | +Create Data Science Cluster: |
| 92 | +```sh |
| 93 | +oc apply -f setup/mlbatch-dsc.yaml |
| 94 | +``` |
| 95 | +The provided configuration differs from the default OpenShift AI configuration |
| 96 | +as follows: |
| 97 | +- Kubeflow Training Operator: |
| 98 | + - `gang-scheduler-name` is set to `scheduler-plugins-scheduler`, |
| 99 | +- Kueue: |
| 100 | + - `manageJobsWithoutQueueName` is enabled, |
| 101 | + - `batch/job` integration is disabled, |
| 102 | +- Codeflare operator: |
| 103 | + - the AppWrapper controller is enabled and configured as follows: |
| 104 | + - `userRBACAdmissionCheck` is disabled, |
| 105 | + - `schedulerName` is set to `scheduler-plugins-scheduler`, |
| 106 | + - `queueName` is set to `default-queue`, |
| 107 | +- pod priorities, resource requests and limits have been adjusted. |
| 108 | + |
| 109 | +To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition |
| 110 | +in OpenShift AI 2.10 installation), do a rolling restart of the Kueue manager. |
| 111 | +```sh |
| 112 | +oc rollout restart deployment/kueue-controller-manager -n redhat-ods-applications |
| 113 | +``` |
| 114 | + |
| 115 | +After doing the restart, verify that you see the following lines in the |
| 116 | +kueue-controller-manager's log: |
| 117 | +```sh |
| 118 | +{"level":"info","ts":"2024-06-25T20:17:25.689638786Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:189","msg":"Registering a validating webhook","GVK":"kubeflow.org/v1, Kind=PyTorchJob","path":"/validate-kubeflow-org-v1-pytorchjob"} |
| 119 | +{"level":"info","ts":"2024-06-25T20:17:25.689698615Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v1-pytorchjob"} |
| 120 | +{"level":"info","ts":"2024-06-25T20:17:25.689743757Z","logger":"setup","caller":"jobframework/setup.go:81","msg":"Set up controller and webhook for job framework","jobFrameworkName":"kubeflow.org/pytorchjob"} |
| 121 | + |
| 122 | +``` |
| 123 | + |
| 124 | +### Kueue Configuration |
| 125 | + |
| 126 | +Create Kueue's default flavor: |
| 127 | +```sh |
| 128 | +oc apply -f setup/default-flavor.yaml |
| 129 | +``` |
| 130 | + |
| 131 | +### Cluster Role |
| 132 | + |
| 133 | +Create `mlbatch-edit` role: |
| 134 | +```sh |
| 135 | +oc apply -f setup/mlbatch-edit-role.yaml |
| 136 | +``` |
| 137 | + |
| 138 | +## Project Setup |
| 139 | + |
| 140 | +The project setup creates a project, a user group, a quota, a queue, and the |
| 141 | +required role bindings. |
| 142 | + |
| 143 | +Create project: |
| 144 | +```sh |
| 145 | +oc new-project team1 |
| 146 | +``` |
| 147 | +Create user group: |
| 148 | +```sh |
| 149 | +oc adm groups new team1-edit-group |
| 150 | +``` |
| 151 | +Add users to group for example: |
| 152 | +```sh |
| 153 | +oc adm groups add-users team1-edit-group user1 |
| 154 | +``` |
| 155 | +Bind cluster role to group in namespace: |
| 156 | +```sh |
| 157 | +oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1 |
| 158 | +``` |
| 159 | +Specify the intended quota for the namespace by creating a `ClusterQueue`: |
| 160 | +```sh |
| 161 | +oc apply -f- << EOF |
| 162 | +apiVersion: kueue.x-k8s.io/v1beta1 |
| 163 | +kind: ClusterQueue |
| 164 | +metadata: |
| 165 | + name: team1-cluster-queue |
| 166 | +spec: |
| 167 | + namespaceSelector: {} |
| 168 | + cohort: default-cohort |
| 169 | + preemption: |
| 170 | + withinClusterQueue: LowerOrNewerEqualPriority |
| 171 | + reclaimWithinCohort: Any |
| 172 | + borrowWithinCohort: |
| 173 | + policy: Never |
| 174 | + resourceGroups: |
| 175 | + - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"] |
| 176 | + flavors: |
| 177 | + - name: default-flavor |
| 178 | + resources: |
| 179 | + - name: "cpu" |
| 180 | + nominalQuota: 8000m |
| 181 | + # borrowingLimit: 0 |
| 182 | + # lendingLimit: 0 |
| 183 | + - name: "memory" |
| 184 | + nominalQuota: 128Gi |
| 185 | + # borrowingLimit: 0 |
| 186 | + # lendingLimit: 0 |
| 187 | + - name: "nvidia.com/gpu" |
| 188 | + nominalQuota: 16 |
| 189 | + # borrowingLimit: 0 |
| 190 | + # lendingLimit: 0 |
| 191 | + - name: "nvidia.com/roce_gdr" |
| 192 | + nominalQuota: 4 |
| 193 | + # borrowingLimit: 0 |
| 194 | + # lendingLimit: 0 |
| 195 | + - name: "pods" |
| 196 | + nominalQuota: 100 |
| 197 | + # borrowingLimit: 0 |
| 198 | + # lendingLimit: 0 |
| 199 | +EOF |
| 200 | +``` |
| 201 | +Edit the above quantities to adjust the quota to the desired values. Pod counts |
| 202 | +are optional and can be omitted from the list of covered resources. |
| 203 | + |
| 204 | +Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing |
| 205 | +quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other |
| 206 | +namespaces from borrowing quota from this namespace. |
| 207 | + |
| 208 | +Create a `LocalQueue` to bind the `ClusterQueue` to the namespace: |
| 209 | +```sh |
| 210 | +oc apply -n team1 -f- << EOF |
| 211 | +apiVersion: kueue.x-k8s.io/v1beta1 |
| 212 | +kind: LocalQueue |
| 213 | +metadata: |
| 214 | + name: default-queue |
| 215 | +spec: |
| 216 | + clusterQueue: team1-cluster-queue |
| 217 | +EOF |
| 218 | +``` |
| 219 | +We recommend naming the local queue `default-queue` as `AppWrappers` will |
| 220 | +default to this queue name. |
| 221 | + |
| 222 | +## Quota Maintenance |
| 223 | + |
| 224 | +Kubernetes built-in `ResourceQuotas` should not be combined with Kueue quotas. |
| 225 | + |
| 226 | +Kueue quotas can be adjusted post creation. Workloads already admitted are not |
| 227 | +impacted. |
| 228 | + |
| 229 | +For Kueue quotas to be effective, the sum of all quotas for each managed |
| 230 | +resource (`cpu`, `memory`, `nvidia.com/gpu`, `pods`) must be maintained to |
| 231 | +remain less than or equal to the available cluster capacity for this resource. |
| 232 | +Concretely, for cluster with 256 NVIDIA GPUs dedicated to MLBatch users, the |
| 233 | +cumulative `nomimalQuota` for the `nvidia.com/gpu` resource should be 256 or |
| 234 | +less. Quotas should be reduced when the available capacity is reduced whether |
| 235 | +because of failures or due to the allocation of resources to non-batch |
| 236 | +workloads. |
| 237 | + |
| 238 | +To facilitate the necessary quota adjustments, one option is to setup a |
| 239 | +dedicated cluster queue for slack capacity that other cluster queues can borrow |
| 240 | +from. This queue should not be associated with any team, project, namespace, or |
| 241 | +local queue. Its quota should be adjusted dynamically to reflect changes in |
| 242 | +cluster capacity. If sized appropriately, this queue will make adjustments to |
| 243 | +other cluster queues unnecessary for small cluster capacity changes. Concretely, |
| 244 | +two teams could be granted 45% of the cluster capacity, with 10% capacity set |
| 245 | +aside for this extra cluster queue. Any changes to the cluster capacity below |
| 246 | +10% can then be handled by adjusting the latter. |
| 247 | + |
| 248 | +Every resource name occurring in the resource requests or limits of a workload |
| 249 | +must be covered by a cluster queue intended to admit the workload, even if the |
| 250 | +requested resource count is zero. For example. a cluster queue must cover |
| 251 | +`nvidia.com/roce_gdr`, possibly with an empty quota, to admit a `PyTorchJob` |
| 252 | +requesting: |
| 253 | +```yaml |
| 254 | + resources: |
| 255 | + requests: |
| 256 | + cpu: 1 |
| 257 | + memory: 256Mi |
| 258 | + nvidia.com/roce_gdr: 0 |
| 259 | + limits: |
| 260 | + cpu: 1 |
| 261 | + memory: 256Mi |
| 262 | + nvidia.com/roce_gdr: 0 |
| 263 | +``` |
| 264 | +
|
| 265 | +## Cleanup |
| 266 | +
|
| 267 | +To uninstall the MLBatch controllers and reclaim the corresponding namespaces, |
| 268 | +run: |
| 269 | +```sh |
| 270 | +# OpenShift AI uninstall |
| 271 | +oc delete dsc mlbatch-dsc |
| 272 | +oc delete dsci mlbatch-dsci |
| 273 | +oc delete subscription -n redhat-ods-operator rhods-operator |
| 274 | +oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator |
| 275 | +oc delete crd featuretrackers.features.opendatahub.io \ |
| 276 | + dscinitializations.dscinitialization.opendatahub.io \ |
| 277 | + datascienceclusters.datasciencecluster.opendatahub.io |
| 278 | +oc delete operators rhods-operator.redhat-ods-operator |
| 279 | +oc delete operatorgroup -n redhat-ods-operator rhods-operator |
| 280 | +oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator |
| 281 | + |
| 282 | +# Coscheduler uninstall |
| 283 | +helm uninstall -n scheduler-plugins scheduler-plugins |
| 284 | +oc delete namespace scheduler-plugins |
| 285 | +``` |
0 commit comments