Skip to content

Commit 3411773

Browse files
authored
Reformat some README.md files (#2059)
1 parent a7bcee1 commit 3411773

File tree

3 files changed

+167
-75
lines changed

3 files changed

+167
-75
lines changed

README.md

Lines changed: 57 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -5,57 +5,96 @@
55
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
66
[![PyPI Status Badge](https://badge.fury.io/py/elasticdl.svg)](https://pypi.org/project/elasticdl/)
77

8-
ElasticDL is a Kubernetes-native deep learning framework built on top of TensorFlow 2.0 that supports fault-tolerance and elastic scheduling.
8+
ElasticDL is a Kubernetes-native deep learning framework built on top of
9+
TensorFlow 2.0 that supports fault-tolerance and elastic scheduling.
910

1011
| | TensorFlow 1.x graph mode | TensorFlow 2.x eager execution |
1112
|--------------------------|---------------------------|--------------------------------|
1213
| No change to the runtime | Uber Horovod | ElasticDL (early stage) |
1314
| Changes the runtime | TensorFlow ps-based distribution | TensorFlow distribution strategies |
1415

15-
**Note that ElasticDL is still under active development, and we have not extensively tested it in production environments. We open sourced this early-stage project with the hope of encouraging further work on fault-tolerance and elastic scheduling from the community.**
16+
**Note that ElasticDL is still under active development, and we have not
17+
extensively tested it in production environments. We open sourced this
18+
early-stage project with the hope of encouraging further work on fault-tolerance
19+
and elastic scheduling from the community.**
1620

1721
## Main Features
1822

1923
### Elastic Scheduling and Fault-Tolerance
2024

21-
Through Kubernetes-native design, ElasticDL enables fault-tolerance and works with the priority-based preemption of Kubernetes to achieve elastic scheduling for deep learning tasks.
25+
Through Kubernetes-native design, ElasticDL enables fault-tolerance and works
26+
with the priority-based preemption of Kubernetes to achieve elastic scheduling
27+
for deep learning tasks.
2228

2329
### TensorFlow 2.0 Eager Execution
2430

25-
A distributed deep learning framework needs to know local gradients before the model update. Eager Execution allows ElasticDL to do it without hacking into the graph execution process.
31+
A distributed deep learning framework needs to know local gradients before the
32+
model update. Eager Execution allows ElasticDL to do it without hacking into the
33+
graph execution process.
2634

2735
### Minimalism Interface
2836

2937
Given a model defined with Keras API, train the model with a command line.
38+
3039
```bash
31-
elasticdl train --model_def=mnist_functional_api.custom_model --training_data=/mnist/train --output=output
40+
elasticdl train \
41+
--model_def=mnist_functional_api.custom_model \
42+
--training_data=/mnist/train --output=output
3243
```
3344

3445
### Integration with SQLFlow
3546

36-
ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to distributed deep learning tasks with ElasticDL.
47+
ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to
48+
distributed deep learning tasks with ElasticDL.
3749

3850
```sql
3951
SELECT * FROM employee LABEL income INTO my_elasticdl_model
4052
```
4153

4254
## Quick Start
4355

44-
Please check out our [step-by-step tutorial](docs/tutorials/get_started.md) for running ElasticDL on local laptop, on-prem cluster, or on public cloud such as Google Kubernetes Engine.
56+
Please check out our [step-by-step tutorial](docs/tutorials/get_started.md) for
57+
running ElasticDL on local laptop, on-prem cluster, or on public cloud such as
58+
Google Kubernetes Engine.
4559

4660
## Background
4761

48-
TensorFlow has its native distributed computing feature that is fault-recoverable. In the case that some processes fail, the distributed computing job would fail; however, we can restart the job and recover its status from the most recent checkpoint files.
49-
50-
ElasticDL, as an enhancement of TensorFlow's distributed training feature, supports fault-tolerance. In the case that some processes fail, the job would go on running. Therefore, ElasticDL doesn't need to checkpoint nor recover from checkpoints.
51-
52-
The feature of fault-tolerance makes ElasticDL works with the priority-based preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills some processes of a job to free resource for new-coming jobs with higher priority, the current job doesn't fail but continues with less resource.
53-
54-
Elastic scheduling could significantly improve the overall utilization of a cluster. Suppose that a cluster has N GPUs, and a job is using one of them. Without elastic scheduling, a new job claiming N GPUs would have to wait for the first job to complete before starting. This pending time could be hours, days, or even weeks. During this very long time, the utilization of the cluster is 1/N. With elastic scheduling, the new job could start running immediately with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the first job completes. In this case, the overall utilization is 100%.
55-
56-
The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native design -- it doesn't rely on Kubernetes extensions like Kubeflow to run TensorFlow programs; instead, the master process of an ElasticDL job calls Kubernetes API to start workers and parameter servers; it also watches events like process/pod killing and reacts to such events to realize fault-tolerance.
57-
58-
In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic scheduling in the case that you have a Kubernetes cluster. We provide a tutorial showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL jobs there. We respect TensorFlow's native distributed computing feature, which doesn't require specific computing platforms like Kubernetes and allows TensorFlow running on any platform.
62+
TensorFlow has its native distributed computing feature that is
63+
fault-recoverable. In the case that some processes fail, the distributed
64+
computing job would fail; however, we can restart the job and recover its status
65+
from the most recent checkpoint files.
66+
67+
ElasticDL, as an enhancement of TensorFlow's distributed training feature,
68+
supports fault-tolerance. In the case that some processes fail, the job would go
69+
on running. Therefore, ElasticDL doesn't need to checkpoint nor recover from
70+
checkpoints.
71+
72+
The feature of fault-tolerance makes ElasticDL works with the priority-based
73+
preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills
74+
some processes of a job to free resource for new-coming jobs with higher
75+
priority, the current job doesn't fail but continues with less resource.
76+
77+
Elastic scheduling could significantly improve the overall utilization of a
78+
cluster. Suppose that a cluster has N GPUs, and a job is using one of
79+
them. Without elastic scheduling, a new job claiming N GPUs would have to wait
80+
for the first job to complete before starting. This pending time could be hours,
81+
days, or even weeks. During this very long time, the utilization of the cluster
82+
is 1/N. With elastic scheduling, the new job could start running immediately
83+
with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the
84+
first job completes. In this case, the overall utilization is 100%.
85+
86+
The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native
87+
design -- it doesn't rely on Kubernetes extensions like Kubeflow to run
88+
TensorFlow programs; instead, the master process of an ElasticDL job calls
89+
Kubernetes API to start workers and parameter servers; it also watches events
90+
like process/pod killing and reacts to such events to realize fault-tolerance.
91+
92+
In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic
93+
scheduling in the case that you have a Kubernetes cluster. We provide a tutorial
94+
showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL
95+
jobs there. We respect TensorFlow's native distributed computing feature, which
96+
doesn't require specific computing platforms like Kubernetes and allows
97+
TensorFlow running on any platform.
5998

6099
## Development Guide
61100

elasticdl/README.md

Lines changed: 77 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22

33
## Development Docker Image
44

5-
Note that Docker 17.05 or higher is required to build docker images, as Dockerfile is using multi-stage build.
5+
Note that Docker 17.05 or higher is required to build docker images, as
6+
Dockerfile is using multi-stage build.
67

7-
Development Docker image contains dependencies for ElasticDL development. In repo's root directory, run the following command:
8+
Development Docker image contains dependencies for ElasticDL development. In
9+
repo's root directory, run the following command:
810

911
```bash
1012
docker build \
@@ -23,9 +25,12 @@ docker build \
2325
--build-arg BASE_IMAGE=tensorflow/tensorflow:2.1.0-gpu-py3 .
2426
```
2527

26-
Note that since ElasticDL depends on TensorFlow, the base image must have TensorFlow installed.
28+
Note that since ElasticDL depends on TensorFlow, the base image must have
29+
TensorFlow installed.
2730

28-
When having difficulties downloading from the main PyPI site or Golang site, you could pass some extra build arguments to `docker build`, `EXTRA_PYPI_INDEX` for PyPI site and `GO_MIRROR_URL` for the mirror of Golang installation package:
31+
When having difficulties downloading from the main PyPI site or Golang site, you
32+
could pass some extra build arguments to `docker build`, `EXTRA_PYPI_INDEX` for
33+
PyPI site and `GO_MIRROR_URL` for the mirror of Golang installation package:
2934

3035
```bash
3136
docker build \
@@ -36,8 +41,9 @@ docker build \
3641
-f elasticdl/docker/Dockerfile .
3742
```
3843

39-
40-
To develop in the Docker container, run the following command to mount your cloned `elasticdl` git repo directory (e.g. `EDL_REPO` below) to `/elasticdl` directory in the container and start container:
44+
To develop in the Docker container, run the following command to mount your
45+
cloned `elasticdl` git repo directory (e.g. `EDL_REPO` below) to `/elasticdl`
46+
directory in the container and start container:
4147

4248
```bash
4349
EDL_REPO=<your_elasticdl_git_repo>
@@ -49,7 +55,10 @@ docker run --rm -u $(id -u):$(id -g) -it \
4955

5056
## Continuous Integration Docker Image
5157

52-
Continuous integration docker image contains everything from the development docker image, processed demo data in RecordIO format and the ElasticDL source code. It is used to run continuous integration with the latest version of the source code. In repo's root directory, run the following command:
58+
Continuous integration docker image contains everything from the development
59+
docker image, processed demo data in RecordIO format and the ElasticDL source
60+
code. It is used to run continuous integration with the latest version of the
61+
source code. In repo's root directory, run the following command:
5362

5463
```bash
5564
docker build \
@@ -62,13 +71,14 @@ docker build \
6271

6372
### Pre-commit Check
6473

65-
We have set up pre-commit checks in the Github repo for pull requests, which can catch some Python style problems. However, to avoid waiting in the Travis CI queue, you can run the pre-commit checks locally:
74+
We have set up pre-commit checks in the Github repo for pull requests, which can
75+
catch some Python style problems. However, to avoid waiting in the Travis CI
76+
queue, you can run the pre-commit checks locally:
6677

6778
```bash
6879
docker run --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
6980
elasticdl:dev \
70-
bash -c \
71-
"pre-commit run --files $(find elasticdl/python elasticdl_preprocessing model_zoo setup.py scripts/ -name '*.py' -print0 | tr '\0' ' ') $(find elasticdl/pkg -name '*.go' -print0 | tr '\0' ' ')"
81+
bash -c "pre-commit run -a"
7282
```
7383

7484
### Unit Tests
@@ -89,22 +99,26 @@ docker run --rm -u $(id -u):$(id -g) -it \
8999
bash -c "make -f elasticdl/Makefile && K8S_TESTS=False pytest elasticdl/python/tests"
90100
```
91101

92-
Note that, some unit tests may require a running Kubernetes cluster available. To include those unit tests, run
93-
the following:
102+
Note that, some unit tests may require a running Kubernetes cluster
103+
available. To include those unit tests, run the following:
94104

95105
```bash
96106
make -f elasticdl/Makefile && pytest elasticdl/python/tests
97107
```
98108

99-
[MaxCompute](https://www.alibabacloud.com/product/maxcompute)-related tests require additional environment variables. To run those tests, execute the following:
109+
[MaxCompute](https://www.alibabacloud.com/product/maxcompute)-related tests
110+
require additional environment variables. To run those tests, execute the
111+
following:
100112

101113
```bash
102114
docker run --rm -it -v $PWD:/edl_dir -w /edl_dir \
103115
-e MAXCOMPUTE_PROJECT=xxx \
104116
-e MAXCOMPUTE_AK=xxx \
105117
-e MAXCOMPUTE_SK=xxx \
106118
-e MAXCOMPUTE_ENDPOINT=xxx \
107-
elasticdl:dev bash -c "make -f elasticdl/Makefile && K8S_TESTS=False pytest elasticdl/python/tests/odps_* elasticdl/python/tests/data_reader_test.py"
119+
elasticdl:dev bash -c "make -f elasticdl/Makefile && \
120+
K8S_TESTS=False pytest elasticdl/python/tests/odps_* \
121+
elasticdl/python/tests/data_reader_test.py"
108122
```
109123

110124
### Test in Docker
@@ -144,48 +158,73 @@ docker run --net=host --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
144158
--log_level=INFO"
145159
```
146160

147-
This will train MNIST data with a model defined in [model_zoo/mnist_functional_api/mnist_functional_api.py](../model_zoo/mnist_functional_api/mnist_functional_api.py) for 2 epoches. Note that, the master will save model checkpoints in a local directory `checkpoint_dir`.
161+
This will train MNIST data with a model defined in
162+
[model_zoo/mnist_functional_api/mnist_functional_api.py](../model_zoo/mnist_functional_api/mnist_functional_api.py)
163+
for 2 epoches. Note that, the master will save model checkpoints in a local
164+
directory `checkpoint_dir`.
165+
166+
If you get some issues related to proto definitions, please run the following
167+
command to build latest proto components.
148168

149-
If you get some issues related to proto definitions, please run the following command to build latest proto components.
150169
```bash
151170
make -f elasticdl/Makefile
152171
```
153172

154173
### Test with Kubernetes
155174

156-
We can also test ElasticDL job in a Kubernetes cluster using the previously built [image](#development-docker-image).
175+
We can also test ElasticDL job in a Kubernetes cluster using the previously
176+
built [image](#development-docker-image).
177+
178+
First make sure the built image has been pushed to a docker registry, and then
179+
run the following command to launch the job.
157180

158-
First make sure the built image has been pushed to a docker registry, and then run the following command to launch the job.
159181
```bash
160182
kubectl apply -f manifests/elasticdl.yaml
161183
```
162184

163185
You might want to change the value of the `imagePullPolicy` property into
164186
`Alway` or `Never` in your trial.
165187

166-
If you find permission error in the main pod log, e.g., `"pods is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"pods\""`, you need to grant pod-related permissions for the default user.
188+
If you find permission error in the main pod log, e.g., `"pods is forbidden:
189+
User \"system:serviceaccount:default:default\" cannot create resource
190+
\"pods\""`, you need to grant pod-related permissions for the default user.
191+
167192
```bash
168193
kubectl apply -f manifests/examples/elasticdl-rbac.yaml
169194
```
170195

171196
### Test on Travis CI
172197

173-
All tests will be executed on [Travis CI](https://travis-ci.org/sql-machine-learning/elasticdl), which includes:
174-
* Pre-commit checks
175-
* Unit tests
176-
* Integration tests
177-
178-
The unit tests and integration tests also contain tests running on a local Kubernetes cluster via [Minikube](https://kubernetes.io/docs/setup/learning-environment/minikube/) and tests that
179-
require data sources from [MaxCompute](https://www.alibabacloud.com/product/maxcompute). Please refer to [Travis configuration file](../.travis.yml) for more details.
180-
181-
Note that tests related to MaxCompute will not be executed on pull requests created from forks since
182-
the MaxCompute access information has been secured on Travis and only those who have write access can retrieve it. Developers who
183-
have write access to this repo are encouraged to submit pull requests from branches instead of forks if any code related to MaxCompute
184-
has been modified.
185-
186-
Also note that two test cases of integration tests involve loading checkpoint. It is not easy to automatically generate checkpoints when doing integration tests. Currently we save a checkpoint file in the [test data folder](python/tests/testdata) of the ElasticDL Github repository and use this checkpoint file for integration tests. Thus you need to re-generate a new checkpoint file if your PR modifies the definition of Model protocol buffer.
187-
188-
If you want to trigger Travis builds without submitting a pull request, you can do so by developing on a branch and add this
189-
branch name to the list in `branches` section in [Travis configuration file](../.travis.yml). Note that you can also trigger
190-
Travis builds from forks but it requires additional work such as activating Travis for the forked repo and MaxCompute related tests
191-
will be skipped as mentioned earlier.
198+
All tests will be executed on [Travis
199+
CI](https://travis-ci.org/sql-machine-learning/elasticdl), which includes:
200+
201+
- Pre-commit checks
202+
- Unit tests
203+
- Integration tests
204+
205+
The unit tests and integration tests also contain tests running on a local
206+
Kubernetes cluster via
207+
[Minikube](https://kubernetes.io/docs/setup/learning-environment/minikube/) and
208+
tests that require data sources from
209+
[MaxCompute](https://www.alibabacloud.com/product/maxcompute). Please refer to
210+
[Travis configuration file](../.travis.yml) for more details.
211+
212+
Note that tests related to MaxCompute will not be executed on pull requests
213+
created from forks since the MaxCompute access information has been secured on
214+
Travis and only those who have write access can retrieve it. Developers who have
215+
write access to this repo are encouraged to submit pull requests from branches
216+
instead of forks if any code related to MaxCompute has been modified.
217+
218+
Also note that two test cases of integration tests involve loading
219+
checkpoint. It is not easy to automatically generate checkpoints when doing
220+
integration tests. Currently we save a checkpoint file in the [test data
221+
folder](python/tests/testdata) of the ElasticDL Github repository and use this
222+
checkpoint file for integration tests. Thus you need to re-generate a new
223+
checkpoint file if your PR modifies the definition of Model protocol buffer.
224+
225+
If you want to trigger Travis builds without submitting a pull request, you can
226+
do so by developing on a branch and add this branch name to the list in
227+
`branches` section in [Travis configuration file](../.travis.yml). Note that you
228+
can also trigger Travis builds from forks but it requires additional work such
229+
as activating Travis for the forked repo and MaxCompute related tests will be
230+
skipped as mentioned earlier.

0 commit comments

Comments
 (0)