You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| No change to the runtime | Uber Horovod | ElasticDL (early stage) |
13
14
| Changes the runtime | TensorFlow ps-based distribution | TensorFlow distribution strategies |
14
15
15
-
**Note that ElasticDL is still under active development, and we have not extensively tested it in production environments. We open sourced this early-stage project with the hope of encouraging further work on fault-tolerance and elastic scheduling from the community.**
16
+
**Note that ElasticDL is still under active development, and we have not
17
+
extensively tested it in production environments. We open sourced this
18
+
early-stage project with the hope of encouraging further work on fault-tolerance
19
+
and elastic scheduling from the community.**
16
20
17
21
## Main Features
18
22
19
23
### Elastic Scheduling and Fault-Tolerance
20
24
21
-
Through Kubernetes-native design, ElasticDL enables fault-tolerance and works with the priority-based preemption of Kubernetes to achieve elastic scheduling for deep learning tasks.
25
+
Through Kubernetes-native design, ElasticDL enables fault-tolerance and works
26
+
with the priority-based preemption of Kubernetes to achieve elastic scheduling
27
+
for deep learning tasks.
22
28
23
29
### TensorFlow 2.0 Eager Execution
24
30
25
-
A distributed deep learning framework needs to know local gradients before the model update. Eager Execution allows ElasticDL to do it without hacking into the graph execution process.
31
+
A distributed deep learning framework needs to know local gradients before the
32
+
model update. Eager Execution allows ElasticDL to do it without hacking into the
33
+
graph execution process.
26
34
27
35
### Minimalism Interface
28
36
29
37
Given a model defined with Keras API, train the model with a command line.
ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to distributed deep learning tasks with ElasticDL.
47
+
ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to
48
+
distributed deep learning tasks with ElasticDL.
37
49
38
50
```sql
39
51
SELECT*FROM employee LABEL income INTO my_elasticdl_model
40
52
```
41
53
42
54
## Quick Start
43
55
44
-
Please check out our [step-by-step tutorial](docs/tutorials/get_started.md) for running ElasticDL on local laptop, on-prem cluster, or on public cloud such as Google Kubernetes Engine.
56
+
Please check out our [step-by-step tutorial](docs/tutorials/get_started.md) for
57
+
running ElasticDL on local laptop, on-prem cluster, or on public cloud such as
58
+
Google Kubernetes Engine.
45
59
46
60
## Background
47
61
48
-
TensorFlow has its native distributed computing feature that is fault-recoverable. In the case that some processes fail, the distributed computing job would fail; however, we can restart the job and recover its status from the most recent checkpoint files.
49
-
50
-
ElasticDL, as an enhancement of TensorFlow's distributed training feature, supports fault-tolerance. In the case that some processes fail, the job would go on running. Therefore, ElasticDL doesn't need to checkpoint nor recover from checkpoints.
51
-
52
-
The feature of fault-tolerance makes ElasticDL works with the priority-based preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills some processes of a job to free resource for new-coming jobs with higher priority, the current job doesn't fail but continues with less resource.
53
-
54
-
Elastic scheduling could significantly improve the overall utilization of a cluster. Suppose that a cluster has N GPUs, and a job is using one of them. Without elastic scheduling, a new job claiming N GPUs would have to wait for the first job to complete before starting. This pending time could be hours, days, or even weeks. During this very long time, the utilization of the cluster is 1/N. With elastic scheduling, the new job could start running immediately with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the first job completes. In this case, the overall utilization is 100%.
55
-
56
-
The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native design -- it doesn't rely on Kubernetes extensions like Kubeflow to run TensorFlow programs; instead, the master process of an ElasticDL job calls Kubernetes API to start workers and parameter servers; it also watches events like process/pod killing and reacts to such events to realize fault-tolerance.
57
-
58
-
In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic scheduling in the case that you have a Kubernetes cluster. We provide a tutorial showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL jobs there. We respect TensorFlow's native distributed computing feature, which doesn't require specific computing platforms like Kubernetes and allows TensorFlow running on any platform.
62
+
TensorFlow has its native distributed computing feature that is
63
+
fault-recoverable. In the case that some processes fail, the distributed
64
+
computing job would fail; however, we can restart the job and recover its status
65
+
from the most recent checkpoint files.
66
+
67
+
ElasticDL, as an enhancement of TensorFlow's distributed training feature,
68
+
supports fault-tolerance. In the case that some processes fail, the job would go
69
+
on running. Therefore, ElasticDL doesn't need to checkpoint nor recover from
70
+
checkpoints.
71
+
72
+
The feature of fault-tolerance makes ElasticDL works with the priority-based
73
+
preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills
74
+
some processes of a job to free resource for new-coming jobs with higher
75
+
priority, the current job doesn't fail but continues with less resource.
76
+
77
+
Elastic scheduling could significantly improve the overall utilization of a
78
+
cluster. Suppose that a cluster has N GPUs, and a job is using one of
79
+
them. Without elastic scheduling, a new job claiming N GPUs would have to wait
80
+
for the first job to complete before starting. This pending time could be hours,
81
+
days, or even weeks. During this very long time, the utilization of the cluster
82
+
is 1/N. With elastic scheduling, the new job could start running immediately
83
+
with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the
84
+
first job completes. In this case, the overall utilization is 100%.
85
+
86
+
The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native
87
+
design -- it doesn't rely on Kubernetes extensions like Kubeflow to run
88
+
TensorFlow programs; instead, the master process of an ElasticDL job calls
89
+
Kubernetes API to start workers and parameter servers; it also watches events
90
+
like process/pod killing and reacts to such events to realize fault-tolerance.
91
+
92
+
In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic
93
+
scheduling in the case that you have a Kubernetes cluster. We provide a tutorial
94
+
showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL
95
+
jobs there. We respect TensorFlow's native distributed computing feature, which
96
+
doesn't require specific computing platforms like Kubernetes and allows
Note that since ElasticDL depends on TensorFlow, the base image must have TensorFlow installed.
28
+
Note that since ElasticDL depends on TensorFlow, the base image must have
29
+
TensorFlow installed.
27
30
28
-
When having difficulties downloading from the main PyPI site or Golang site, you could pass some extra build arguments to `docker build`, `EXTRA_PYPI_INDEX` for PyPI site and `GO_MIRROR_URL` for the mirror of Golang installation package:
31
+
When having difficulties downloading from the main PyPI site or Golang site, you
32
+
could pass some extra build arguments to `docker build`, `EXTRA_PYPI_INDEX` for
33
+
PyPI site and `GO_MIRROR_URL` for the mirror of Golang installation package:
29
34
30
35
```bash
31
36
docker build \
@@ -36,8 +41,9 @@ docker build \
36
41
-f elasticdl/docker/Dockerfile .
37
42
```
38
43
39
-
40
-
To develop in the Docker container, run the following command to mount your cloned `elasticdl` git repo directory (e.g. `EDL_REPO` below) to `/elasticdl` directory in the container and start container:
44
+
To develop in the Docker container, run the following command to mount your
45
+
cloned `elasticdl` git repo directory (e.g. `EDL_REPO` below) to `/elasticdl`
Continuous integration docker image contains everything from the development docker image, processed demo data in RecordIO format and the ElasticDL source code. It is used to run continuous integration with the latest version of the source code. In repo's root directory, run the following command:
58
+
Continuous integration docker image contains everything from the development
59
+
docker image, processed demo data in RecordIO format and the ElasticDL source
60
+
code. It is used to run continuous integration with the latest version of the
61
+
source code. In repo's root directory, run the following command:
53
62
54
63
```bash
55
64
docker build \
@@ -62,13 +71,14 @@ docker build \
62
71
63
72
### Pre-commit Check
64
73
65
-
We have set up pre-commit checks in the Github repo for pull requests, which can catch some Python style problems. However, to avoid waiting in the Travis CI queue, you can run the pre-commit checks locally:
74
+
We have set up pre-commit checks in the Github repo for pull requests, which can
75
+
catch some Python style problems. However, to avoid waiting in the Travis CI
76
+
queue, you can run the pre-commit checks locally:
66
77
67
78
```bash
68
79
docker run --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
Note that, some unit tests may require a running Kubernetes cluster available. To include those unit tests, run
93
-
the following:
102
+
Note that, some unit tests may require a running Kubernetes cluster
103
+
available. To include those unit tests, run the following:
94
104
95
105
```bash
96
106
make -f elasticdl/Makefile && pytest elasticdl/python/tests
97
107
```
98
108
99
-
[MaxCompute](https://www.alibabacloud.com/product/maxcompute)-related tests require additional environment variables. To run those tests, execute the following:
This will train MNIST data with a model defined in [model_zoo/mnist_functional_api/mnist_functional_api.py](../model_zoo/mnist_functional_api/mnist_functional_api.py) for 2 epoches. Note that, the master will save model checkpoints in a local directory `checkpoint_dir`.
161
+
This will train MNIST data with a model defined in
for 2 epoches. Note that, the master will save model checkpoints in a local
164
+
directory `checkpoint_dir`.
165
+
166
+
If you get some issues related to proto definitions, please run the following
167
+
command to build latest proto components.
148
168
149
-
If you get some issues related to proto definitions, please run the following command to build latest proto components.
150
169
```bash
151
170
make -f elasticdl/Makefile
152
171
```
153
172
154
173
### Test with Kubernetes
155
174
156
-
We can also test ElasticDL job in a Kubernetes cluster using the previously built [image](#development-docker-image).
175
+
We can also test ElasticDL job in a Kubernetes cluster using the previously
176
+
built [image](#development-docker-image).
177
+
178
+
First make sure the built image has been pushed to a docker registry, and then
179
+
run the following command to launch the job.
157
180
158
-
First make sure the built image has been pushed to a docker registry, and then run the following command to launch the job.
159
181
```bash
160
182
kubectl apply -f manifests/elasticdl.yaml
161
183
```
162
184
163
185
You might want to change the value of the `imagePullPolicy` property into
164
186
`Alway` or `Never` in your trial.
165
187
166
-
If you find permission error in the main pod log, e.g., `"pods is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"pods\""`, you need to grant pod-related permissions for the default user.
188
+
If you find permission error in the main pod log, e.g., `"pods is forbidden:
189
+
User \"system:serviceaccount:default:default\" cannot create resource
190
+
\"pods\""`, you need to grant pod-related permissions for the default user.
All tests will be executed on [Travis CI](https://travis-ci.org/sql-machine-learning/elasticdl), which includes:
174
-
* Pre-commit checks
175
-
* Unit tests
176
-
* Integration tests
177
-
178
-
The unit tests and integration tests also contain tests running on a local Kubernetes cluster via [Minikube](https://kubernetes.io/docs/setup/learning-environment/minikube/) and tests that
179
-
require data sources from [MaxCompute](https://www.alibabacloud.com/product/maxcompute). Please refer to [Travis configuration file](../.travis.yml) for more details.
180
-
181
-
Note that tests related to MaxCompute will not be executed on pull requests created from forks since
182
-
the MaxCompute access information has been secured on Travis and only those who have write access can retrieve it. Developers who
183
-
have write access to this repo are encouraged to submit pull requests from branches instead of forks if any code related to MaxCompute
184
-
has been modified.
185
-
186
-
Also note that two test cases of integration tests involve loading checkpoint. It is not easy to automatically generate checkpoints when doing integration tests. Currently we save a checkpoint file in the [test data folder](python/tests/testdata) of the ElasticDL Github repository and use this checkpoint file for integration tests. Thus you need to re-generate a new checkpoint file if your PR modifies the definition of Model protocol buffer.
187
-
188
-
If you want to trigger Travis builds without submitting a pull request, you can do so by developing on a branch and add this
189
-
branch name to the list in `branches` section in [Travis configuration file](../.travis.yml). Note that you can also trigger
190
-
Travis builds from forks but it requires additional work such as activating Travis for the forked repo and MaxCompute related tests
191
-
will be skipped as mentioned earlier.
198
+
All tests will be executed on [Travis
199
+
CI](https://travis-ci.org/sql-machine-learning/elasticdl), which includes:
200
+
201
+
- Pre-commit checks
202
+
- Unit tests
203
+
- Integration tests
204
+
205
+
The unit tests and integration tests also contain tests running on a local
206
+
Kubernetes cluster via
207
+
[Minikube](https://kubernetes.io/docs/setup/learning-environment/minikube/) and
208
+
tests that require data sources from
209
+
[MaxCompute](https://www.alibabacloud.com/product/maxcompute). Please refer to
210
+
[Travis configuration file](../.travis.yml) for more details.
211
+
212
+
Note that tests related to MaxCompute will not be executed on pull requests
213
+
created from forks since the MaxCompute access information has been secured on
214
+
Travis and only those who have write access can retrieve it. Developers who have
215
+
write access to this repo are encouraged to submit pull requests from branches
216
+
instead of forks if any code related to MaxCompute has been modified.
217
+
218
+
Also note that two test cases of integration tests involve loading
219
+
checkpoint. It is not easy to automatically generate checkpoints when doing
220
+
integration tests. Currently we save a checkpoint file in the [test data
221
+
folder](python/tests/testdata) of the ElasticDL Github repository and use this
222
+
checkpoint file for integration tests. Thus you need to re-generate a new
223
+
checkpoint file if your PR modifies the definition of Model protocol buffer.
224
+
225
+
If you want to trigger Travis builds without submitting a pull request, you can
226
+
do so by developing on a branch and add this branch name to the list in
227
+
`branches` section in [Travis configuration file](../.travis.yml). Note that you
228
+
can also trigger Travis builds from forks but it requires additional work such
229
+
as activating Travis for the forked repo and MaxCompute related tests will be
0 commit comments