Skip to content

Commit 313c6b0

Browse files
Update the tutorial for elasticdl local run. (#2118)
* Update the tutorial for local run. * Do some rephrase * Update the image name. * Update according to the comments. * Update the minikube version.
1 parent 1171ef0 commit 313c6b0

File tree

1 file changed

+45
-28
lines changed

1 file changed

+45
-28
lines changed

docs/tutorials/elasticdl_local.md

Lines changed: 45 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,13 @@ the working process of ElasticDL.
66

77
## Environment preparation
88

9-
Here we should install Minikube first. Please refer to the official
9+
1. Install Minikube >= v1.11.0. Please refer to the official
1010
[installation guide](https://kubernetes.io/docs/tasks/tools/install-minikube/).
11-
1211
In this tutorial, we use [hyperkit](https://github.com/moby/hyperkit) as the
1312
hypervisor of Minikube.
13+
1. Install [Docker CE >= 18.x](https://docs.docker.com/docker-for-mac/install/)
14+
for building the Docker images of the distributed ElasticDL jobs.
15+
1. Install Python >= 3.6.
1416

1517
## Write model file
1618

@@ -20,48 +22,63 @@ we use a model predefined in model zoo directory.
2022

2123
## Summit Job to Minikube
2224

23-
### Install ElasticDL
25+
### Install ElasticDL Client
26+
27+
```bash
28+
pip install elasticdl_client
29+
```
30+
31+
Clone elasticdl repo for model zoo and some scripts.
2432

2533
```bash
2634
git clone https://github.com/sql-machine-learning/elasticdl.git
27-
cd elasticdl
28-
pip install -r elasticdl/requirements.txt
29-
python setup.py install
3035
```
3136

3237
### Setup Kubernetes related environment
3338

3439
```bash
35-
minikube start --vm-driver=hyperkit --cpus 2 --memory 6144 --disk-size=20gb
40+
export DATA_PATH={a_folder_path_to_store_training_data}
41+
minikube start --vm-driver=hyperkit --cpus 2 --memory 6144 --disk-size=50gb --mount=true --mount-string="$DATA_PATH:/data"
42+
cd elasticdl
3643
kubectl apply -f elasticdl/manifests/elasticdl-rbac.yaml
3744
eval $(minikube docker-env)
38-
export DOCKER_BUILDKIT=1
39-
export TRAVIS_BUILD_DIR=$PWD
40-
bash scripts/travis/build_images.sh
4145
```
4246

43-
### Summit a training job
47+
Mount the host path $DATA_PATH to /data of minikube
4448

45-
There are other docker settings to configure before submitting the training job.
49+
### Build the Docker image for distributed training
4650

47-
For example:
51+
```bash
52+
cd model_zoo
53+
elasticdl zoo init
54+
elasticdl zoo build --image=elasticdl:mnist .
55+
```
56+
57+
We use the model predefined in model zoo directory. The model definition will
58+
be packed into the new Docker image `elasticdl:mnist`.
59+
60+
### Prepare the dataset
61+
62+
We generate MNIST training and evaluation data in RecordIO format. We provide a
63+
script in elasticdl repo.
4864

4965
```bash
50-
export DOCKER_BASE_URL=tcp://192.168.64.5:2376
51-
export DOCKER_TLSCERT=${HOME}/.minikube/certs/cert.pem
52-
export DOCKER_TLSKEY=${HOME}/.minikube/certs/key.pem
66+
docker pull elasticdl/elasticdl:dev
67+
cd {elasticdl_repo_root}
68+
docker run --rm -it \
69+
-v $HOME/.keras/datasets:/root/.keras/datasets \
70+
-v $PWD:/work \
71+
-w /work elasticdl/elasticdl:dev \
72+
bash -c "scripts/gen_dataset.sh $DATA_PATH"
5373
```
5474

55-
We can get these setting values by running `minikube docker-env`.
75+
### Summit a training job
5676

5777
We use the following command to submit a training job:
5878

5979
```bash
6080
elasticdl train \
61-
--image_base=elasticdl:ci \
62-
--docker_base_url=${DOCKER_BASE_URL} \
63-
--docker_tlscert=${DOCKER_TLSCERT} \
64-
--docker_tlskey=${DOCKER_TLSKEY} \
81+
--image_name=elasticdl:mnist \
6582
--model_zoo=model_zoo \
6683
--model_def=mnist_functional_api.mnist_functional_api.custom_model \
6784
--training_data=/data/mnist/train \
@@ -82,15 +99,15 @@ elasticdl train \
8299
--job_name=test-mnist \
83100
--log_level=INFO \
84101
--image_pull_policy=Never \
102+
--volume="/data,mount_path=/data" \
85103
--distribution_strategy=ParameterServerStrategy
86104
```
87105

88-
`image_base` is the base docker image argument. A new image will be built based
89-
on it each time while submitting the Elastic job.
106+
`image_name` is the Docker image name for the distributed ElasticDL job. We built
107+
it using the `elasticdl zoo build` command above.
90108

91-
We use the model predefined in model zoo directory. The model definition will be
92-
packed into the new docker image. The training and validation data are packaged
93-
to the base docker image already. We could use them directly.
109+
The directory to store the training and validation data are mounted into Minikube
110+
in the previous step. We will then mount it in the path `/data` inside the pod.
94111

95112
In this example, we use parameter server strategy. We launch a master pod, a
96113
parameter server(PS) pod and a worker pod. The worker pod gets model parameters
@@ -126,7 +143,7 @@ kubectl logs elasticdl-test-mnist-worker-0 | grep "Loss"
126143

127144
We will see following logs:
128145

129-
```bash
146+
```txt
130147
[2020-04-14 02:46:28,535] [INFO] [worker.py:879:_process_minibatch] Loss is 3.07190203666687
131148
[2020-04-14 02:46:28,920] [INFO] [worker.py:879:_process_minibatch] Loss is 9.413976669311523
132149
[2020-04-14 02:46:29,120] [INFO] [worker.py:879:_process_minibatch] Loss is 3.9641590118408203
@@ -164,7 +181,7 @@ kubectl logs elasticdl-test-mnist-master | grep "Evaluation"
164181

165182
We will see following logs:
166183

167-
```bash
184+
```txt
168185
[2020-04-14 02:46:21,836] [INFO] [master.py:192:prepare] Evaluation service started
169186
[2020-04-14 02:46:40,750] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=50]: {'accuracy': 0.21933334}
170187
[2020-04-14 02:46:53,827] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=100]: {'accuracy': 0.5173333}

0 commit comments

Comments
 (0)