@@ -6,11 +6,13 @@ the working process of ElasticDL.
6
6
7
7
## Environment preparation
8
8
9
- Here we should install Minikube first . Please refer to the official
9
+ 1 . Install Minikube >= v1.11.0 . Please refer to the official
10
10
[ installation guide] ( https://kubernetes.io/docs/tasks/tools/install-minikube/ ) .
11
-
12
11
In this tutorial, we use [ hyperkit] ( https://github.com/moby/hyperkit ) as the
13
12
hypervisor of Minikube.
13
+ 1 . Install [ Docker CE >= 18.x] ( https://docs.docker.com/docker-for-mac/install/ )
14
+ for building the Docker images of the distributed ElasticDL jobs.
15
+ 1 . Install Python >= 3.6.
14
16
15
17
## Write model file
16
18
@@ -20,48 +22,63 @@ we use a model predefined in model zoo directory.
20
22
21
23
## Summit Job to Minikube
22
24
23
- ### Install ElasticDL
25
+ ### Install ElasticDL Client
26
+
27
+ ``` bash
28
+ pip install elasticdl_client
29
+ ```
30
+
31
+ Clone elasticdl repo for model zoo and some scripts.
24
32
25
33
``` bash
26
34
git clone https://github.com/sql-machine-learning/elasticdl.git
27
- cd elasticdl
28
- pip install -r elasticdl/requirements.txt
29
- python setup.py install
30
35
```
31
36
32
37
### Setup Kubernetes related environment
33
38
34
39
``` bash
35
- minikube start --vm-driver=hyperkit --cpus 2 --memory 6144 --disk-size=20gb
40
+ export DATA_PATH={a_folder_path_to_store_training_data}
41
+ minikube start --vm-driver=hyperkit --cpus 2 --memory 6144 --disk-size=50gb --mount=true --mount-string=" $DATA_PATH :/data"
42
+ cd elasticdl
36
43
kubectl apply -f elasticdl/manifests/elasticdl-rbac.yaml
37
44
eval $( minikube docker-env)
38
- export DOCKER_BUILDKIT=1
39
- export TRAVIS_BUILD_DIR=$PWD
40
- bash scripts/travis/build_images.sh
41
45
```
42
46
43
- ### Summit a training job
47
+ Mount the host path $DATA_PATH to /data of minikube
44
48
45
- There are other docker settings to configure before submitting the training job.
49
+ ### Build the Docker image for distributed training
46
50
47
- For example:
51
+ ``` bash
52
+ cd model_zoo
53
+ elasticdl zoo init
54
+ elasticdl zoo build --image=elasticdl:mnist .
55
+ ```
56
+
57
+ We use the model predefined in model zoo directory. The model definition will
58
+ be packed into the new Docker image ` elasticdl:mnist ` .
59
+
60
+ ### Prepare the dataset
61
+
62
+ We generate MNIST training and evaluation data in RecordIO format. We provide a
63
+ script in elasticdl repo.
48
64
49
65
``` bash
50
- export DOCKER_BASE_URL=tcp://192.168.64.5:2376
51
- export DOCKER_TLSCERT=${HOME} /.minikube/certs/cert.pem
52
- export DOCKER_TLSKEY=${HOME} /.minikube/certs/key.pem
66
+ docker pull elasticdl/elasticdl:dev
67
+ cd {elasticdl_repo_root}
68
+ docker run --rm -it \
69
+ -v $HOME /.keras/datasets:/root/.keras/datasets \
70
+ -v $PWD :/work \
71
+ -w /work elasticdl/elasticdl:dev \
72
+ bash -c " scripts/gen_dataset.sh $DATA_PATH "
53
73
```
54
74
55
- We can get these setting values by running ` minikube docker-env ` .
75
+ ### Summit a training job
56
76
57
77
We use the following command to submit a training job:
58
78
59
79
``` bash
60
80
elasticdl train \
61
- --image_base=elasticdl:ci \
62
- --docker_base_url=${DOCKER_BASE_URL} \
63
- --docker_tlscert=${DOCKER_TLSCERT} \
64
- --docker_tlskey=${DOCKER_TLSKEY} \
81
+ --image_name=elasticdl:mnist \
65
82
--model_zoo=model_zoo \
66
83
--model_def=mnist_functional_api.mnist_functional_api.custom_model \
67
84
--training_data=/data/mnist/train \
@@ -82,15 +99,15 @@ elasticdl train \
82
99
--job_name=test-mnist \
83
100
--log_level=INFO \
84
101
--image_pull_policy=Never \
102
+ --volume=" /data,mount_path=/data" \
85
103
--distribution_strategy=ParameterServerStrategy
86
104
```
87
105
88
- ` image_base ` is the base docker image argument. A new image will be built based
89
- on it each time while submitting the Elastic job .
106
+ ` image_name ` is the Docker image name for the distributed ElasticDL job. We built
107
+ it using the ` elasticdl zoo build ` command above .
90
108
91
- We use the model predefined in model zoo directory. The model definition will be
92
- packed into the new docker image. The training and validation data are packaged
93
- to the base docker image already. We could use them directly.
109
+ The directory to store the training and validation data are mounted into Minikube
110
+ in the previous step. We will then mount it in the path ` /data ` inside the pod.
94
111
95
112
In this example, we use parameter server strategy. We launch a master pod, a
96
113
parameter server(PS) pod and a worker pod. The worker pod gets model parameters
@@ -126,7 +143,7 @@ kubectl logs elasticdl-test-mnist-worker-0 | grep "Loss"
126
143
127
144
We will see following logs:
128
145
129
- ``` bash
146
+ ``` txt
130
147
[2020-04-14 02:46:28,535] [INFO] [worker.py:879:_process_minibatch] Loss is 3.07190203666687
131
148
[2020-04-14 02:46:28,920] [INFO] [worker.py:879:_process_minibatch] Loss is 9.413976669311523
132
149
[2020-04-14 02:46:29,120] [INFO] [worker.py:879:_process_minibatch] Loss is 3.9641590118408203
@@ -164,7 +181,7 @@ kubectl logs elasticdl-test-mnist-master | grep "Evaluation"
164
181
165
182
We will see following logs:
166
183
167
- ``` bash
184
+ ``` txt
168
185
[2020-04-14 02:46:21,836] [INFO] [master.py:192:prepare] Evaluation service started
169
186
[2020-04-14 02:46:40,750] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=50]: {'accuracy': 0.21933334}
170
187
[2020-04-14 02:46:53,827] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=100]: {'accuracy': 0.5173333}
0 commit comments