Skip to content

Commit b3d5af1

Browse files
KPostOfficeopenshift-merge-robot
authored andcommitted
add new document for v2 quickstart
Signed-off-by: Kevin <[email protected]>
1 parent df7a020 commit b3d5af1

File tree

4 files changed

+172
-3
lines changed

4 files changed

+172
-3
lines changed

Quick-Start-ODH-V2.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Quick Start Guide for Distributed Workloads with the CodeFlare Stack
2+
3+
This quick start guide is intended to walk existing Open Data Hub users through installation of the CodeFlare stack and an initial demo using the CodeFlare-SDK from within a Jupyter notebook environment. This will enable users to run and submit distributed workloads.
4+
5+
The CodeFlare-SDK was built to make managing distributed compute infrastructure in the cloud easy and intuitive for Data Scientists. However, that means there needs to be some cloud infrastructure on the backend for users to get the benefit of using the SDK. Currently, we support the CodeFlare stack, which consists of the Open Source projects, [MCAD](https://github.com/project-codeflare/multi-cluster-app-dispatcher), [Instascale](https://github.com/project-codeflare/instascale), [Ray](https://www.ray.io/), and [Pytorch](https://pytorch.org/).
6+
7+
This stack integrates well with [Open Data Hub](https://opendatahub.io/), and helps to bring batch workloads, jobs, and queuing to the Data Science platform.
8+
9+
## Prerequisites
10+
11+
### Resources
12+
13+
In addition to the resources required by default ODH deployments, you will need the following to deploy the Distributed
14+
Workloads stack infrastructure pods:
15+
16+
```text
17+
Total:
18+
CPU: 4100m
19+
Memory: 4608Mi
20+
21+
# By component
22+
Ray:
23+
CPU: 100m
24+
Memory: 512Mi
25+
MCAD
26+
cpu: 2000m
27+
memory: 2Gi
28+
InstaScale:
29+
cpu: 2000m
30+
memory: 2Gi
31+
```
32+
33+
NOTE: The above resources are just for the infrastructure pods. To be able to run actual workloads on your cluster you
34+
will need additional resources based on the size and type of workload.
35+
36+
### OpenShift and Open Data Hub
37+
38+
This Quick Start guide assumes that you have administrator access to an OpenShift cluster and an existing Open Data Hub (ODH) installation with version **~2.Y** is present on your cluster. More information about ODH can be found [here](https://opendatahub.io/docs/quick-installation/). But the quick step to install ODH is as follows:
39+
40+
- Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `Open Data Hub Operator` and install it using the `fast` channel. (It should be version 2.Y.Z)
41+
42+
### CodeFlare Operator
43+
44+
The CodeFlare operator must be installed from the OperatorHub on your OpenShift cluster. The default settings will
45+
suffice.
46+
47+
### NFD and GPU Operators
48+
49+
If you want to run GPU enabled workloads, you will need to install the [Node Feature Discovery Operator](https://github.com/openshift/cluster-nfd-operator) and the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) from the OperatorHub. For instructions on how to install and configure these operators, we recommend [this guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/steps-overview.html#high-level-steps).
50+
51+
52+
## Creating K8s resources
53+
54+
1. Create the opendatahub namespace with the following command:
55+
56+
```bash
57+
oc create ns opendatahub
58+
```
59+
60+
1. Create a datascience cluster with CodeFlare and Ray enabled:
61+
62+
```bash
63+
oc apply -f https://raw.githubusercontent.com/opendatahub-io/distributed-workloads/main/codeflare-dsc.yaml
64+
```
65+
66+
Applying the above DataScienceCluster will result in the following objects being added to your cluster:
67+
68+
1. MCAD
69+
1. InstaScale
70+
1. KubeRay Operator
71+
1. CodeFlare Notebook Image for the Open Data Hub notebook interface
72+
73+
This image is managed by project CodeFlare and contains the correct packages of codeflare-sdk, pytorch, torchx, ect required to run distributed workloads.
74+
75+
At this point you should be able to go to your notebook spawner page and select "Codeflare Notebook" from your list of notebook images and start an instance.
76+
77+
You can access the spawner page through the Open Data Hub dashboard. The default route should be `https://odh-dashboard-<your ODH namespace>.apps.<your cluster's uri>`. Once you are on your dashboard, you can select "Launch application" on the Jupyter application. This will take you to your notebook spawner page.
78+
79+
80+
### Using an Openshift Dedicated or ROSA Cluster
81+
If you are using an Openshift Dedicated or ROSA Cluster you will need to create a secret in the opendatahub namespace containing your ocm token. You can find your token [here](https://console.redhat.com/openshift/token). Navigate to Workloads -> secrets in the Openshift Console. Click Create and choose a key/value secret. Secret name: instascale-ocm-secret, Key: token, Value: < ocm token > and click create.
82+
83+
<img src="images/instascale-ocm-secret.png" width="80%" height="80%">
84+
85+
## Submit your first job
86+
87+
We can now go ahead and submit our first distributed model training job to our cluster.
88+
89+
This can be done from any python based environment, including a script or a jupyter notebook. For this guide, we'll assume you've selected the "Codeflare Notebook" from the list of available images on your notebook spawner page.
90+
91+
### Clone the demo code
92+
93+
Once your notebook environment is ready, in order to test our CodeFlare stack we will want to run though some of the demo notebooks provided by the CodeFlare community. So let's start by cloning their repo into our working environment.
94+
95+
```bash
96+
git clone https://github.com/project-codeflare/codeflare-sdk
97+
cd codeflare-sdk
98+
```
99+
100+
### Run the Guided Demo Notebooks
101+
102+
There are a number of guided demos you can follow to become familiar with the CodeFlare-SDK and the CodeFlare stack. Navigate to the path: `codeflare-sdk/demo-notebooks/guided-demos` to see and run the latest demos.
103+
104+
## Cleaning up the CodeFlare Install
105+
To completely clean up all the CodeFlare components after an install, follow these steps:
106+
107+
1. No appwrappers should be left running:
108+
```bash
109+
oc get appwrappers -A
110+
```
111+
If any are left, you'd want to delete them
112+
113+
2. Remove the notebook and notebook pvc:
114+
```bash
115+
oc delete notebook jupyter-nb-kube-3aadmin -n opendatahub
116+
oc delete pvc jupyterhub-nb-kube-3aadmin-pvc -n opendatahub
117+
```
118+
3. Remove the example datascience cluster: (Removes MCAD, InstaScale, KubeRay and the Notebook image)
119+
``` bash
120+
oc delete dsc example-dsc
121+
```
122+
123+
4. Remove the CodeFlare Operator csv and subscription: (Removes the CodeFlare Operator from the OpenShift Cluster)
124+
```bash
125+
oc delete sub codeflare-operator -n openshift-operators
126+
oc delete csv `oc get csv -n opendatahub |grep codeflare-operator |awk '{print $1}'` -n openshift-operators
127+
```
128+
129+
5. Remove the CodeFlare CRDs
130+
```bash
131+
oc delete crd instascales.codeflare.codeflare.dev mcads.codeflare.codeflare.dev schedulingspecs.mcad.ibm.com appwrappers.mcad.ibm.com quotasubtrees.ibm.com
132+
```
133+
134+
## Next Steps
135+
136+
And with that you have gotten started using the CodeFlare stack alongside your Open Data Hub Deployment to add distributed workloads and batch computing to your machine learning platform.
137+
138+
You are now ready to try out the stack with your own machine learning workloads. If you'd like some more examples, you can also run through the existing demo code provided by the Codeflare-SDK community.
139+
140+
* [Submit batch jobs](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/2_basic_jobs.ipynb)
141+
* [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/3_basic_interactive.ipynb)

Quick-Start.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -163,5 +163,5 @@ And with that you have gotten started using the CodeFlare stack alongside your O
163163
164164
You are now ready to try out the stack with your own machine learning workloads. If you'd like some more examples, you can also run through the existing demo code provided by the Codeflare-SDK community.
165165
166-
* [Submit batch jobs](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/guided-demos)
167-
* [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/additional-demos)
166+
* [Submit batch jobs](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/2_basic_jobs.ipynb)
167+
* [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/3_basic_interactive.ipynb)

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,4 +31,6 @@ Integration of this stack into the Open Data Hub is owned by the Distributed Wor
3131

3232
## Quick Start
3333

34-
Follow our quick start guide [here](/Quick-Start.md) to get up and running with Distributed Workloads on Open Data Hub.
34+
Follow our quick start guide [here](/Quick-Start.md) to get up and running with Distributed Workloads on Open Data Hub.
35+
36+
For the V2 version of the ODH operator follow [this](/Quick-Start-ODH-V2.md) guide instead.

codeflare-dsc.yaml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
apiVersion: datasciencecluster.opendatahub.io/v1alpha1
2+
kind: DataScienceCluster
3+
metadata:
4+
labels:
5+
app.kubernetes.io/created-by: opendatahub-operator
6+
app.kubernetes.io/instance: default
7+
app.kubernetes.io/managed-by: kustomize
8+
app.kubernetes.io/name: datasciencecluster
9+
app.kubernetes.io/part-of: opendatahub-operator
10+
name: example-dsc
11+
spec:
12+
components:
13+
codeflare:
14+
enabled: true
15+
dashboard:
16+
enabled: true
17+
datasciencepipelines:
18+
enabled: false
19+
kserve:
20+
enabled: false
21+
modelmeshserving:
22+
enabled: false
23+
ray:
24+
enabled: true
25+
workbenches:
26+
enabled: true

0 commit comments

Comments
 (0)