Skip to content

Commit ca8e541

Browse files
committed
Initial import
0 parents  commit ca8e541

19 files changed

+1291
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.vscode

.gitmodules

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[submodule "scheduler-plugins"]
2+
path = scheduler-plugins
3+
url = https://github.com/kubernetes-sigs/scheduler-plugins.git
4+
branch = release-1.28

README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# MLBatch
2+
3+
This repository describes the [setup](SETUP.md) and [use](USAGE.md) of the
4+
MLBatch queuing and quota management system on OpenShift clusters. MLBatch
5+
leverages [Kueue](https://kueue.sigs.k8s.io), the [Kubeflow Training
6+
Operator](https://www.kubeflow.org/docs/components/training/),
7+
[KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html), and the
8+
[Codeflare Operator](https://github.com/project-codeflare/codeflare-operator)
9+
from [Red Hat OpenShift
10+
AI](https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai).
11+
MLBatch enables [AppWrappers](https://project-codeflare.github.io/appwrapper/)
12+
and adds
13+
[Coscheduler](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md).
14+
MLBatch includes a number of configuration steps to help these components work
15+
in harmony and support large workloads on large clusters.
16+
17+
MLBatch handles the queuing and dispatching of batch workloads on OpenShift
18+
clusters. It enforces team quotas at the namespace level. It automates the
19+
borrowing and reclamation of unused quotas across teams. Teams can use
20+
priorities within their namespaces without impact on other teams. Using
21+
AppWrappers to submit workloads activates a number of fault detection and
22+
recovery capabilities, including automatically detecting failed pods and
23+
automatically retrying failed workloads. Coscheduler supports gang scheduling
24+
and minimizes fragmentation by preferentially packing jobs requiring less than a
25+
full node's worth of GPUs together.
26+
27+
## Cluster Setup
28+
29+
To learn how to setup MLBatch on a cluster and onboard teams see
30+
[SETUP.md](SETUP.md).
31+
32+
## Quick Start
33+
34+
To learn how to use MLBatch to run workloads see [USAGE.md](USAGE.md).

SETUP.md

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
# MLBatch Setup
2+
3+
The MLBatch setup consists of a [cluster setup](#cluster-setup) to be done once
4+
and a [project setup](#project-setup) to be repeated for each team. This
5+
document also discusses [quota maintenance](#quota-maintenance).
6+
7+
Batch users should only be permitted to create AppWrappers or workloads whose
8+
types are natively supported by Kueue. The provided `mlbatch-edit` role permits
9+
the creation of `PyTorchJobs`, `RayJobs`, `RayClusters`, and `AppWrappers`.
10+
Kueue at this time has no mechanism for granular quota enforcement for `Jobs`,
11+
i.e., no mechanism to enforce quotas only on user-submitted `Jobs` without
12+
impacting OpenShift-internal `Jobs`. As a consequence, MLBatch disables queuing
13+
and quota management for `Jobs` and the `mlbatch-edit` role does not give
14+
permission to create `Jobs`. While `Jobs`, or `Pods` and `Deployments` cannot be
15+
created by MLBatch users directly, `AppWrappers` can easily wrap and bundle
16+
resources of these types. See [USAGE.md](USAGE.md) for examples.
17+
18+
This setup has been developed on OpenShift 4.14 is intended to support OpenShift
19+
4.12 and up.
20+
21+
To start with, recursively clone and enter this repository:
22+
```sh
23+
git clone --recursive https://github.com/project-codeflare/mlbatch.git
24+
cd mlbatch
25+
```
26+
27+
## Cluster Setup
28+
29+
The cluster setup installs OpenShift AI and Coscheduler, configures Kueue,
30+
cluster roles, and priority classes.
31+
32+
If MLBatch is deployed on a cluster that used to run earlier versions of ODH,
33+
[MCAD](https://github.com/project-codeflare/mcad), OpenShift AI, or Coscheduler,
34+
make sure to scrub traces of these installations. In particular, make sure to
35+
delete the following custom resource definitions (CRD) if present on the
36+
cluster. Make sure to delete all instances prior to deleting the CRDs:
37+
```sh
38+
# Delete old appwrappers and crd
39+
oc delete appwrappers --all -A
40+
oc delete crd appwrappers.workload.codeflare.dev
41+
42+
# Delete old noderesourcetopologies and crd
43+
oc delete noderesourcetopologies --all -A
44+
oc delete crd noderesourcetopologies.topology.node.k8s.io
45+
```
46+
47+
### Priorities
48+
49+
Create `default-priority`, `high-priority`, and `low-priority` priority classes:
50+
```sh
51+
oc apply -f setup/mlbatch-priorities.yaml
52+
```
53+
54+
### Coscheduler
55+
56+
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
57+
```sh
58+
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
59+
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
60+
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]'
61+
```
62+
Patch Coscheduler pod priorities:
63+
```sh
64+
oc patch deployment -n scheduler-plugins --type=json --patch-file setup/coscheduler-priority-patch.yaml scheduler-plugins-controller
65+
oc patch deployment -n scheduler-plugins --type=json --patch-file setup/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
66+
```
67+
68+
### OpenShift AI
69+
70+
Create OpenShift AI 2.10 subscription:
71+
```sh
72+
oc apply -f setup/mlbatch-subscription.yaml
73+
````
74+
Identify install plan:
75+
```sh
76+
oc get ip -n redhat-ods-operator
77+
```
78+
```
79+
NAMESPACE NAME CSV APPROVAL APPROVED
80+
redhat-ods-operator install-kmh8w rhods-operator.2.10.0 Manual false
81+
```
82+
Approve install plan replacing the generated plan name below with the actual
83+
value:
84+
```sh
85+
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
86+
```
87+
Create DSC Initialization:
88+
```sh
89+
oc apply -f setup/mlbatch-dsci.yaml
90+
```
91+
Create Data Science Cluster:
92+
```sh
93+
oc apply -f setup/mlbatch-dsc.yaml
94+
```
95+
The provided configuration differs from the default OpenShift AI configuration
96+
as follows:
97+
- Kubeflow Training Operator:
98+
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
99+
- Kueue:
100+
- `manageJobsWithoutQueueName` is enabled,
101+
- `batch/job` integration is disabled,
102+
- Codeflare operator:
103+
- the AppWrapper controller is enabled and configured as follows:
104+
- `userRBACAdmissionCheck` is disabled,
105+
- `schedulerName` is set to `scheduler-plugins-scheduler`,
106+
- `queueName` is set to `default-queue`,
107+
- pod priorities, resource requests and limits have been adjusted.
108+
109+
To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition
110+
in OpenShift AI 2.10 installation), do a rolling restart of the Kueue manager.
111+
```sh
112+
oc rollout restart deployment/kueue-controller-manager -n redhat-ods-applications
113+
```
114+
115+
After doing the restart, verify that you see the following lines in the
116+
kueue-controller-manager's log:
117+
```sh
118+
{"level":"info","ts":"2024-06-25T20:17:25.689638786Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:189","msg":"Registering a validating webhook","GVK":"kubeflow.org/v1, Kind=PyTorchJob","path":"/validate-kubeflow-org-v1-pytorchjob"}
119+
{"level":"info","ts":"2024-06-25T20:17:25.689698615Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v1-pytorchjob"}
120+
{"level":"info","ts":"2024-06-25T20:17:25.689743757Z","logger":"setup","caller":"jobframework/setup.go:81","msg":"Set up controller and webhook for job framework","jobFrameworkName":"kubeflow.org/pytorchjob"}
121+
122+
```
123+
124+
### Kueue Configuration
125+
126+
Create Kueue's default flavor:
127+
```sh
128+
oc apply -f setup/default-flavor.yaml
129+
```
130+
131+
### Cluster Role
132+
133+
Create `mlbatch-edit` role:
134+
```sh
135+
oc apply -f setup/mlbatch-edit-role.yaml
136+
```
137+
138+
## Project Setup
139+
140+
The project setup creates a project, a user group, a quota, a queue, and the
141+
required role bindings.
142+
143+
Create project:
144+
```sh
145+
oc new-project team1
146+
```
147+
Create user group:
148+
```sh
149+
oc adm groups new team1-edit-group
150+
```
151+
Add users to group for example:
152+
```sh
153+
oc adm groups add-users team1-edit-group user1
154+
```
155+
Bind cluster role to group in namespace:
156+
```sh
157+
oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1
158+
```
159+
Specify the intended quota for the namespace by creating a `ClusterQueue`:
160+
```sh
161+
oc apply -f- << EOF
162+
apiVersion: kueue.x-k8s.io/v1beta1
163+
kind: ClusterQueue
164+
metadata:
165+
name: team1-cluster-queue
166+
spec:
167+
namespaceSelector: {}
168+
cohort: default-cohort
169+
preemption:
170+
withinClusterQueue: LowerOrNewerEqualPriority
171+
reclaimWithinCohort: Any
172+
borrowWithinCohort:
173+
policy: Never
174+
resourceGroups:
175+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
176+
flavors:
177+
- name: default-flavor
178+
resources:
179+
- name: "cpu"
180+
nominalQuota: 8000m
181+
# borrowingLimit: 0
182+
# lendingLimit: 0
183+
- name: "memory"
184+
nominalQuota: 128Gi
185+
# borrowingLimit: 0
186+
# lendingLimit: 0
187+
- name: "nvidia.com/gpu"
188+
nominalQuota: 16
189+
# borrowingLimit: 0
190+
# lendingLimit: 0
191+
- name: "nvidia.com/roce_gdr"
192+
nominalQuota: 4
193+
# borrowingLimit: 0
194+
# lendingLimit: 0
195+
- name: "pods"
196+
nominalQuota: 100
197+
# borrowingLimit: 0
198+
# lendingLimit: 0
199+
EOF
200+
```
201+
Edit the above quantities to adjust the quota to the desired values. Pod counts
202+
are optional and can be omitted from the list of covered resources.
203+
204+
Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing
205+
quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other
206+
namespaces from borrowing quota from this namespace.
207+
208+
Create a `LocalQueue` to bind the `ClusterQueue` to the namespace:
209+
```sh
210+
oc apply -n team1 -f- << EOF
211+
apiVersion: kueue.x-k8s.io/v1beta1
212+
kind: LocalQueue
213+
metadata:
214+
name: default-queue
215+
spec:
216+
clusterQueue: team1-cluster-queue
217+
EOF
218+
```
219+
We recommend naming the local queue `default-queue` as `AppWrappers` will
220+
default to this queue name.
221+
222+
## Quota Maintenance
223+
224+
Kubernetes built-in `ResourceQuotas` should not be combined with Kueue quotas.
225+
226+
Kueue quotas can be adjusted post creation. Workloads already admitted are not
227+
impacted.
228+
229+
For Kueue quotas to be effective, the sum of all quotas for each managed
230+
resource (`cpu`, `memory`, `nvidia.com/gpu`, `pods`) must be maintained to
231+
remain less than or equal to the available cluster capacity for this resource.
232+
Concretely, for cluster with 256 NVIDIA GPUs dedicated to MLBatch users, the
233+
cumulative `nomimalQuota` for the `nvidia.com/gpu` resource should be 256 or
234+
less. Quotas should be reduced when the available capacity is reduced whether
235+
because of failures or due to the allocation of resources to non-batch
236+
workloads.
237+
238+
To facilitate the necessary quota adjustments, one option is to setup a
239+
dedicated cluster queue for slack capacity that other cluster queues can borrow
240+
from. This queue should not be associated with any team, project, namespace, or
241+
local queue. Its quota should be adjusted dynamically to reflect changes in
242+
cluster capacity. If sized appropriately, this queue will make adjustments to
243+
other cluster queues unnecessary for small cluster capacity changes. Concretely,
244+
two teams could be granted 45% of the cluster capacity, with 10% capacity set
245+
aside for this extra cluster queue. Any changes to the cluster capacity below
246+
10% can then be handled by adjusting the latter.
247+
248+
Every resource name occurring in the resource requests or limits of a workload
249+
must be covered by a cluster queue intended to admit the workload, even if the
250+
requested resource count is zero. For example. a cluster queue must cover
251+
`nvidia.com/roce_gdr`, possibly with an empty quota, to admit a `PyTorchJob`
252+
requesting:
253+
```yaml
254+
resources:
255+
requests:
256+
cpu: 1
257+
memory: 256Mi
258+
nvidia.com/roce_gdr: 0
259+
limits:
260+
cpu: 1
261+
memory: 256Mi
262+
nvidia.com/roce_gdr: 0
263+
```
264+
265+
## Cleanup
266+
267+
To uninstall the MLBatch controllers and reclaim the corresponding namespaces,
268+
run:
269+
```sh
270+
# OpenShift AI uninstall
271+
oc delete dsc mlbatch-dsc
272+
oc delete dsci mlbatch-dsci
273+
oc delete subscription -n redhat-ods-operator rhods-operator
274+
oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
275+
oc delete crd featuretrackers.features.opendatahub.io \
276+
dscinitializations.dscinitialization.opendatahub.io \
277+
datascienceclusters.datasciencecluster.opendatahub.io
278+
oc delete operators rhods-operator.redhat-ods-operator
279+
oc delete operatorgroup -n redhat-ods-operator rhods-operator
280+
oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator
281+
282+
# Coscheduler uninstall
283+
helm uninstall -n scheduler-plugins scheduler-plugins
284+
oc delete namespace scheduler-plugins
285+
```

0 commit comments

Comments
 (0)