|
1 | 1 | # ElasticDL on Local Environment
|
2 | 2 |
|
3 |
| -This document aims to give a simple example to show how to submit deep learning jobs to a local kubernetes cluster in a local computer. It helps to understand the working process of ElasticDL. |
| 3 | +This document aims to give a simple example to show how to submit deep learning |
| 4 | +jobs to a local kubernetes cluster in a local computer. It helps to understand |
| 5 | +the working process of ElasticDL. |
4 | 6 |
|
5 | 7 | ## Environment preparation
|
6 | 8 |
|
7 |
| -Here we should install Minikube first. Please refer to the official [installation guide](https://kubernetes.io/docs/tasks/tools/install-minikube/). |
| 9 | +Here we should install Minikube first. Please refer to the official |
| 10 | +[installation guide](https://kubernetes.io/docs/tasks/tools/install-minikube/). |
8 | 11 |
|
9 |
| -In this tutorial, we use [hyperkit](https://github.com/moby/hyperkit) as the hypervisor of Minikube. |
| 12 | +In this tutorial, we use [hyperkit](https://github.com/moby/hyperkit) as the |
| 13 | +hypervisor of Minikube. |
10 | 14 |
|
11 | 15 | ## Write model file
|
12 | 16 |
|
13 |
| -We use TensorFlow Keras API to build our models. Please refer to this [tutorials](model_building.md) on model building for details. |
14 |
| - |
15 |
| -In this tutorial, we use a [model](https://github.com/sql-machine-learning/elasticdl/blob/develop/model_zoo/mnist_functional_api/mnist_functional_api.py) predefined in model zoo directory. |
| 17 | +We use TensorFlow Keras API to build our models. Please refer to this |
| 18 | +[tutorials](model_building.md) on model building for details. In this tutorial, |
| 19 | +we use a model predefined in model zoo directory. |
16 | 20 |
|
17 | 21 | ## Summit Job to Minikube
|
18 | 22 |
|
@@ -81,15 +85,25 @@ elasticdl train \
|
81 | 85 | --distribution_strategy=ParameterServerStrategy
|
82 | 86 | ```
|
83 | 87 |
|
84 |
| -`image_base` is the base docker image argument. A new image will be built based on it each time while submitting the Elastic job. |
| 88 | +`image_base` is the base docker image argument. A new image will be built based |
| 89 | +on it each time while submitting the Elastic job. |
85 | 90 |
|
86 |
| -We use the model predefined in model zoo directory. The model definition will be packed into the new docker image. The training and validation data are packaged to the base docker image already. We could use them directly. |
| 91 | +We use the model predefined in model zoo directory. The model definition will be |
| 92 | +packed into the new docker image. The training and validation data are packaged |
| 93 | +to the base docker image already. We could use them directly. |
87 | 94 |
|
88 |
| -In this example, we use parameter server strategy. We launch a master pod, a parameter server(PS) pod and a worker pod. The worker pod gets model parameters from the PS pod, computes gradients and sends computed gradients to the PS pod. The PS pod iteratively updates these model parameters using gradients sent by the worker pod. For more details about parameter server strategy, please refer to the [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md). |
| 95 | +In this example, we use parameter server strategy. We launch a master pod, a |
| 96 | +parameter server(PS) pod and a worker pod. The worker pod gets model parameters |
| 97 | +from the PS pod, computes gradients and sends computed gradients to the PS |
| 98 | +pod. The PS pod iteratively updates these model parameters using gradients sent |
| 99 | +by the worker pod. For more details about parameter server strategy, please |
| 100 | +refer to the [design |
| 101 | +doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md). |
89 | 102 |
|
90 | 103 | ### Check job status
|
91 | 104 |
|
92 |
| -After submitting the job to Minikube, we can run the following command to check the status of each pod: |
| 105 | +After submitting the job to Minikube, we can run the following command to check |
| 106 | +the status of each pod: |
93 | 107 |
|
94 | 108 | ```bash
|
95 | 109 | kubectl get pods
|
|
0 commit comments