You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tutorials/elasticdl_cloud.md
+70-55Lines changed: 70 additions & 55 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,70 +1,71 @@
1
1
# ElasticDL on Public Cloud
2
2
3
-
ElasticDL is a Kubernetes-native machine learning framework. This document explains how to run an ElasticDL job on a public cloud, namely, Google Kubernetes Engine (GKE).
3
+
ElasticDL is a Kubernetes-native machine learning framework. This document
4
+
explains how to run an ElasticDL job on a public cloud, namely, Google
5
+
Kubernetes Engine (GKE).
4
6
5
7
## Configure GKE Environment
6
8
7
9
### Create a Project and a Kubernetes Cluster
8
10
9
-
First, we create a new project for elasticdl in [web console](https://console.cloud.google.com/) and a new Kubernetes cluster under this project.
11
+
First, we create a new project for elasticdl in [web
12
+
console](https://console.cloud.google.com/) and a new Kubernetes cluster under
13
+
this project.
10
14
11
15
We will use the project id and cluster name in next steps.
12
16
13
17
### Access the Kubernetes Cluster
14
18
15
-
To access GKE, we need to install [Google Cloud SDK](https://cloud.google.com/sdk/install), which includes command-line tools like `gcloud`.
16
-
19
+
To access GKE, we need to install [Google Cloud
20
+
SDK](https://cloud.google.com/sdk/install), which includes command-line tools
21
+
like `gcloud`.
17
22
18
23
Step 1: Set the PROJECT_ID environment variable in shell.
19
24
20
-
```
25
+
```bash
21
26
export PROJECT_ID=${your_project_id}
22
27
gcloud config set project ${PROJECT_ID}
23
28
```
24
29
25
-
26
30
Step 2: List clusters info with gcloud, and double check it with web console.
27
31
28
-
```
32
+
```bash
29
33
gcloud container clusters list
30
34
```
31
35
32
-
Following is an our testing cluster
33
-
34
-
```
35
-
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
Make sure you have [`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl/) available locally.
42
+
Make sure you have
43
+
[`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl/) available
44
+
locally.
46
45
47
46
Use the following command to list all the started components.
48
47
49
-
```
48
+
```bash
50
49
kubectl get all --all-namespaces
51
50
```
52
51
53
-
54
52
### Config the Kubernetes Cluster
55
53
56
-
ElasticDL jobs require pod creation and deletion permissions. Make sure you have granted related permissions to the default or other related service accounts.
54
+
ElasticDL jobs require pod creation and deletion permissions. Make sure you
55
+
have granted related permissions to the default or other related service
ElasticDL supports elastic scheduling, and works well the priority-based scheduling of Kubernetes. We create two customized PriorityClass in the cluster, high and low.
63
-
62
+
ElasticDL supports elastic scheduling, and works well the priority-based
63
+
scheduling of Kubernetes. We create two customized PriorityClass in the
64
+
cluster, high and low.
64
65
65
66
high.yaml
66
67
67
-
```
68
+
```yaml
68
69
apiVersion: scheduling.k8s.io/v1
69
70
kind: PriorityClass
70
71
metadata:
@@ -75,7 +76,7 @@ globalDefault: false
75
76
76
77
low.yaml
77
78
78
-
```
79
+
```yaml
79
80
apiVersion: scheduling.k8s.io/v1
80
81
kind: PriorityClass
81
82
metadata:
@@ -84,17 +85,19 @@ value: 1000
84
85
globalDefault: false
85
86
```
86
87
87
-
```
88
+
```bash
88
89
kubectl create -f high.yaml
89
90
kubectl create -f low.yaml
90
91
```
91
92
92
-
93
93
### Mount a Volume for the Kubernetes Cluster
94
94
95
-
First, we create a [Cloud Filestore](https://cloud.google.com/filestore) instance in web console.
95
+
First, we create a [Cloud Filestore](https://cloud.google.com/filestore)
96
+
instance in web console.
96
97
97
-
Then we follow the [doc](https://cloud.google.com/filestore/docs/accessing-fileshares) to access fileshares from the Kubernetes cluster.
98
+
Then we follow the
99
+
[doc](https://cloud.google.com/filestore/docs/accessing-fileshares) to access
100
+
fileshares from the Kubernetes cluster.
98
101
99
102
In this example, we create a persistent value claim named `fileserver-claim`.
100
103
@@ -104,21 +107,23 @@ In this example, we create a persistent value claim named `fileserver-claim`.
104
107
105
108
Step 1: We generate MNIST training and evaluation data in RecordIO format.
Step 2: We launch a pod which mounts the volume, and use `kubectl cp` command to copy data from local to the volume.
116
+
Step 2: We launch a pod which mounts the volume, and use `kubectl cp` command
117
+
to copy data from local to the volume.
112
118
113
-
```
119
+
```bash
114
120
kubectl create -f my-pod.yaml
115
-
116
121
kubectl cp mnist my-pod:/data
117
122
```
118
123
119
124
my-pod.yaml
120
125
121
-
```
126
+
```yaml
122
127
apiVersion: v1
123
128
kind: Pod
124
129
metadata:
@@ -139,15 +144,19 @@ spec:
139
144
140
145
### Submit Job
141
146
142
-
Please refer to [elasticdl_local tutorial](./elasticdl_local.md) to build the `elasticdl:ci` image. The difference is that we have to push the image to google cloud repo. We use the following command to get the authentication:
147
+
Please refer to [elasticdl_local tutorial](./elasticdl_local.md) to build the
148
+
`elasticdl:ci` image. The difference is that we have to push the image to
149
+
google cloud repo. We use the following command to get the authentication:
143
150
144
-
```
151
+
```bash
145
152
gcloud auth configure-docker
146
153
```
147
154
148
-
We launch a training job with 2 PS pods and 4 worker pods. The master pod and PS pods are set with priority, while worker pods are set with low priority. The training docker image will be pushed to google cloud repo.
155
+
We launch a training job with 2 PS pods and 4 worker pods. The master pod and
156
+
PS pods are set with priority, while worker pods are set with low priority. The
157
+
training docker image will be pushed to google cloud repo.
ElasticDL supports fault tolerance in distributed training. When a worker pod is killed, the training job does not crash and the master pod will try to relaunch a new worker pod.
210
+
ElasticDL supports fault tolerance in distributed training. When a worker pod
211
+
is killed, the training job does not crash and the master pod will try to
0 commit comments