Reformat some README.md files (#2059)

wangkuiyi · web-flow · commit 34117738b921 · 2020-06-19T13:38:23.000-07:00
diff --git a/README.md b/README.md
@@ -5,57 +5,96 @@
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
 [![PyPI Status Badge](https://badge.fury.io/py/elasticdl.svg)](https://pypi.org/project/elasticdl/)
 
-ElasticDL is a Kubernetes-native deep learning framework built on top of TensorFlow 2.0 that supports fault-tolerance and elastic scheduling.
+ElasticDL is a Kubernetes-native deep learning framework built on top of
+TensorFlow 2.0 that supports fault-tolerance and elastic scheduling.
 
 |                          | TensorFlow 1.x graph mode | TensorFlow 2.x eager execution |
 |--------------------------|---------------------------|--------------------------------|
 | No change to the runtime | Uber Horovod              | ElasticDL (early stage)        |
 | Changes the runtime      | TensorFlow ps-based distribution | TensorFlow distribution strategies |
 
-**Note that ElasticDL is still under active development, and we have not extensively tested it in production environments. We open sourced this early-stage project with the hope of encouraging further work on fault-tolerance and elastic scheduling from the community.**
+**Note that ElasticDL is still under active development, and we have not
+extensively tested it in production environments. We open sourced this
+early-stage project with the hope of encouraging further work on fault-tolerance
+and elastic scheduling from the community.**
 
 ## Main Features
 
 ### Elastic Scheduling and Fault-Tolerance
 
-Through Kubernetes-native design, ElasticDL enables fault-tolerance and works with the priority-based preemption of Kubernetes to achieve elastic scheduling for deep learning tasks.
+Through Kubernetes-native design, ElasticDL enables fault-tolerance and works
+with the priority-based preemption of Kubernetes to achieve elastic scheduling
+for deep learning tasks.
 
 ### TensorFlow 2.0 Eager Execution
 
-A distributed deep learning framework needs to know local gradients before the model update. Eager Execution allows ElasticDL to do it without hacking into the graph execution process.
+A distributed deep learning framework needs to know local gradients before the
+model update. Eager Execution allows ElasticDL to do it without hacking into the
+graph execution process.
 
 ### Minimalism Interface
 
 Given a model defined with Keras API, train the model with a command line.
+
 ```bash
-elasticdl train --model_def=mnist_functional_api.custom_model --training_data=/mnist/train --output=output
+elasticdl train \
+  --model_def=mnist_functional_api.custom_model \
+  --training_data=/mnist/train --output=output
 ```
 
 ### Integration with SQLFlow
 
-ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to distributed deep learning tasks with ElasticDL.
+ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to
+distributed deep learning tasks with ElasticDL.
 
 ```sql
 SELECT * FROM employee LABEL income INTO my_elasticdl_model
 ```
 
 ## Quick Start
 
-Please check out our [step-by-step tutorial](docs/tutorials/get_started.md) for running ElasticDL on local laptop, on-prem cluster, or on public cloud such as Google Kubernetes Engine.
+Please check out our [step-by-step tutorial](docs/tutorials/get_started.md) for
+running ElasticDL on local laptop, on-prem cluster, or on public cloud such as
+Google Kubernetes Engine.
 
 ## Background
 
-TensorFlow has its native distributed computing feature that is fault-recoverable. In the case that some processes fail, the distributed computing job would fail; however, we can restart the job and recover its status from the most recent checkpoint files.
-
-ElasticDL, as an enhancement of TensorFlow's distributed training feature, supports fault-tolerance. In the case that some processes fail, the job would go on running. Therefore, ElasticDL doesn't need to checkpoint nor recover from checkpoints.
-
-The feature of fault-tolerance makes ElasticDL works with the priority-based preemption of Kubernetes to achieve elastic scheduling.  When Kubernetes kills some processes of a job to free resource for new-coming jobs with higher priority, the current job doesn't fail but continues with less resource.
-
-Elastic scheduling could significantly improve the overall utilization of a cluster. Suppose that a cluster has N GPUs, and a job is using one of them. Without elastic scheduling, a new job claiming N GPUs would have to wait for the first job to complete before starting. This pending time could be hours, days, or even weeks. During this very long time, the utilization of the cluster is 1/N. With elastic scheduling, the new job could start running immediately with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the first job completes.  In this case, the overall utilization is 100%.
-
-The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native design -- it doesn't rely on Kubernetes extensions like Kubeflow to run TensorFlow programs; instead, the master process of an ElasticDL job calls Kubernetes API to start workers and parameter servers; it also watches events like process/pod killing and reacts to such events to realize fault-tolerance.
-
-In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic scheduling in the case that you have a Kubernetes cluster. We provide a tutorial showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL jobs there.  We respect TensorFlow's native distributed computing feature, which doesn't require specific computing platforms like Kubernetes and allows TensorFlow running on any platform.
+TensorFlow has its native distributed computing feature that is
+fault-recoverable. In the case that some processes fail, the distributed
+computing job would fail; however, we can restart the job and recover its status
+from the most recent checkpoint files.
+
+ElasticDL, as an enhancement of TensorFlow's distributed training feature,
+supports fault-tolerance. In the case that some processes fail, the job would go
+on running. Therefore, ElasticDL doesn't need to checkpoint nor recover from
+checkpoints.
+
+The feature of fault-tolerance makes ElasticDL works with the priority-based
+preemption of Kubernetes to achieve elastic scheduling.  When Kubernetes kills
+some processes of a job to free resource for new-coming jobs with higher
+priority, the current job doesn't fail but continues with less resource.
+
+Elastic scheduling could significantly improve the overall utilization of a
+cluster. Suppose that a cluster has N GPUs, and a job is using one of
+them. Without elastic scheduling, a new job claiming N GPUs would have to wait
+for the first job to complete before starting. This pending time could be hours,
+days, or even weeks. During this very long time, the utilization of the cluster
+is 1/N. With elastic scheduling, the new job could start running immediately
+with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the
+first job completes.  In this case, the overall utilization is 100%.
+
+The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native
+design -- it doesn't rely on Kubernetes extensions like Kubeflow to run
+TensorFlow programs; instead, the master process of an ElasticDL job calls
+Kubernetes API to start workers and parameter servers; it also watches events
+like process/pod killing and reacts to such events to realize fault-tolerance.
+
+In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic
+scheduling in the case that you have a Kubernetes cluster. We provide a tutorial
+showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL
+jobs there.  We respect TensorFlow's native distributed computing feature, which
+doesn't require specific computing platforms like Kubernetes and allows
+TensorFlow running on any platform.
 
 ## Development Guide
 
diff --git a/elasticdl/README.md b/elasticdl/README.md
@@ -2,9 +2,11 @@
 
 ## Development Docker Image
 
-Note that Docker 17.05 or higher is required to build docker images, as Dockerfile is using multi-stage build.
+Note that Docker 17.05 or higher is required to build docker images, as
+Dockerfile is using multi-stage build.
 
-Development Docker image contains dependencies for ElasticDL development. In repo's root directory, run the following command:
+Development Docker image contains dependencies for ElasticDL development. In
+repo's root directory, run the following command:
 
 ```bash
 docker build \
@@ -23,9 +25,12 @@ docker build \
     --build-arg BASE_IMAGE=tensorflow/tensorflow:2.1.0-gpu-py3 .
 ```
 
-Note that since ElasticDL depends on TensorFlow, the base image must have TensorFlow installed.
+Note that since ElasticDL depends on TensorFlow, the base image must have
+TensorFlow installed.
 
-When having difficulties downloading from the main PyPI site or Golang site, you could pass some extra build arguments to `docker build`, `EXTRA_PYPI_INDEX` for PyPI site and `GO_MIRROR_URL` for the mirror of Golang installation package:
+When having difficulties downloading from the main PyPI site or Golang site, you
+could pass some extra build arguments to `docker build`, `EXTRA_PYPI_INDEX` for
+PyPI site and `GO_MIRROR_URL` for the mirror of Golang installation package:
 
 ```bash
 docker build \
@@ -36,8 +41,9 @@ docker build \
     -f elasticdl/docker/Dockerfile .
 ```
 
-
-To develop in the Docker container, run the following command to mount your cloned `elasticdl` git repo directory (e.g. `EDL_REPO` below) to `/elasticdl` directory in the container and start container:
+To develop in the Docker container, run the following command to mount your
+cloned `elasticdl` git repo directory (e.g. `EDL_REPO` below) to `/elasticdl`
+directory in the container and start container:
 
 ```bash
 EDL_REPO=<your_elasticdl_git_repo>
@@ -49,7 +55,10 @@ docker run --rm -u $(id -u):$(id -g) -it \
 
 ## Continuous Integration Docker Image
 
-Continuous integration docker image contains everything from the development docker image, processed demo data in RecordIO format and the ElasticDL source code. It is used to run continuous integration with the latest version of the source code. In repo's root directory, run the following command:
+Continuous integration docker image contains everything from the development
+docker image, processed demo data in RecordIO format and the ElasticDL source
+code. It is used to run continuous integration with the latest version of the
+source code. In repo's root directory, run the following command:
 
 ```bash
 docker build \
@@ -62,13 +71,14 @@ docker build \
 
 ### Pre-commit Check
 
-We have set up pre-commit checks in the Github repo for pull requests, which can catch some Python style problems. However, to avoid waiting in the Travis CI queue, you can run the pre-commit checks locally:
+We have set up pre-commit checks in the Github repo for pull requests, which can
+catch some Python style problems. However, to avoid waiting in the Travis CI
+queue, you can run the pre-commit checks locally:
 
 ```bash
 docker run --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
     elasticdl:dev \
-    bash -c \
-    "pre-commit run --files $(find elasticdl/python elasticdl_preprocessing model_zoo setup.py scripts/ -name '*.py' -print0 | tr '\0' ' ') $(find elasticdl/pkg -name '*.go' -print0 | tr '\0' ' ')"
+    bash -c "pre-commit run -a"
 ```
 
 ### Unit Tests
@@ -89,22 +99,26 @@ docker run --rm -u $(id -u):$(id -g) -it \
     bash -c "make -f elasticdl/Makefile && K8S_TESTS=False pytest elasticdl/python/tests"
 ```
 
-Note that, some unit tests may require a running Kubernetes cluster available. To include those unit tests, run
-the following:
+Note that, some unit tests may require a running Kubernetes cluster
+available. To include those unit tests, run the following:
 
 ```bash
 make -f elasticdl/Makefile && pytest elasticdl/python/tests
 ```
 
-[MaxCompute](https://www.alibabacloud.com/product/maxcompute)-related tests require additional environment variables. To run those tests, execute the following:
+[MaxCompute](https://www.alibabacloud.com/product/maxcompute)-related tests
+require additional environment variables. To run those tests, execute the
+following:
 
 ```bash
 docker run --rm -it -v $PWD:/edl_dir -w /edl_dir \
     -e MAXCOMPUTE_PROJECT=xxx \
     -e MAXCOMPUTE_AK=xxx \
     -e MAXCOMPUTE_SK=xxx \
     -e MAXCOMPUTE_ENDPOINT=xxx \
-    elasticdl:dev bash -c "make -f elasticdl/Makefile && K8S_TESTS=False pytest elasticdl/python/tests/odps_* elasticdl/python/tests/data_reader_test.py"
+    elasticdl:dev bash -c "make -f elasticdl/Makefile && \
+                    K8S_TESTS=False pytest elasticdl/python/tests/odps_* \
+                    elasticdl/python/tests/data_reader_test.py"
 ```
 
 ### Test in Docker
@@ -144,48 +158,73 @@ docker run --net=host --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
           --log_level=INFO"
 ```
 
-This will train MNIST data with a model defined in [model_zoo/mnist_functional_api/mnist_functional_api.py](../model_zoo/mnist_functional_api/mnist_functional_api.py) for 2 epoches. Note that, the master will save model checkpoints in a local directory `checkpoint_dir`.
+This will train MNIST data with a model defined in
+[model_zoo/mnist_functional_api/mnist_functional_api.py](../model_zoo/mnist_functional_api/mnist_functional_api.py)
+for 2 epoches. Note that, the master will save model checkpoints in a local
+directory `checkpoint_dir`.
+
+If you get some issues related to proto definitions, please run the following
+command to build latest proto components.
 
-If you get some issues related to proto definitions, please run the following command to build latest proto components.
 ```bash
 make -f elasticdl/Makefile
 ```
 
 ### Test with Kubernetes
 
-We can also test ElasticDL job in a Kubernetes cluster using the previously built [image](#development-docker-image).
+We can also test ElasticDL job in a Kubernetes cluster using the previously
+built [image](#development-docker-image).
+
+First make sure the built image has been pushed to a docker registry, and then
+run the following command to launch the job.
 
-First make sure the built image has been pushed to a docker registry, and then run the following command to launch the job.
 ```bash
 kubectl apply -f manifests/elasticdl.yaml
 ```
 
 You might want to change the value of the `imagePullPolicy` property into
 `Alway` or `Never` in your trial.
 
-If you find permission error in the main pod log, e.g., `"pods is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"pods\""`, you need to grant pod-related permissions for the default user.
+If you find permission error in the main pod log, e.g., `"pods is forbidden:
+User \"system:serviceaccount:default:default\" cannot create resource
+\"pods\""`, you need to grant pod-related permissions for the default user.
+
 ```bash
 kubectl apply -f manifests/examples/elasticdl-rbac.yaml
 ```
 
 ### Test on Travis CI
 
-All tests will be executed on [Travis CI](https://travis-ci.org/sql-machine-learning/elasticdl), which includes:
-* Pre-commit checks
-* Unit tests
-* Integration tests
-
-The unit tests and integration tests also contain tests running on a local Kubernetes cluster via [Minikube](https://kubernetes.io/docs/setup/learning-environment/minikube/) and tests that
-require data sources from [MaxCompute](https://www.alibabacloud.com/product/maxcompute). Please refer to [Travis configuration file](../.travis.yml) for more details.
-
-Note that tests related to MaxCompute will not be executed on pull requests created from forks since
-the MaxCompute access information has been secured on Travis and only those who have write access can retrieve it. Developers who
-have write access to this repo are encouraged to submit pull requests from branches instead of forks if any code related to MaxCompute
-has been modified.
-
-Also note that two test cases of integration tests involve loading checkpoint. It is not easy to automatically generate checkpoints when doing integration tests. Currently we save a checkpoint file in the [test data folder](python/tests/testdata) of the ElasticDL Github repository and use this checkpoint file for integration tests. Thus you need to re-generate a new checkpoint file if your PR modifies the definition of Model protocol buffer.
-
-If you want to trigger Travis builds without submitting a pull request, you can do so by developing on a branch and add this
-branch name to the list in `branches` section in [Travis configuration file](../.travis.yml). Note that you can also trigger
-Travis builds from forks but it requires additional work such as activating Travis for the forked repo and MaxCompute related tests
-will be skipped as mentioned earlier.
+All tests will be executed on [Travis
+CI](https://travis-ci.org/sql-machine-learning/elasticdl), which includes:
+
+- Pre-commit checks
+- Unit tests
+- Integration tests
+
+The unit tests and integration tests also contain tests running on a local
+Kubernetes cluster via
+[Minikube](https://kubernetes.io/docs/setup/learning-environment/minikube/) and
+tests that require data sources from
+[MaxCompute](https://www.alibabacloud.com/product/maxcompute). Please refer to
+[Travis configuration file](../.travis.yml) for more details.
+
+Note that tests related to MaxCompute will not be executed on pull requests
+created from forks since the MaxCompute access information has been secured on
+Travis and only those who have write access can retrieve it. Developers who have
+write access to this repo are encouraged to submit pull requests from branches
+instead of forks if any code related to MaxCompute has been modified.
+
+Also note that two test cases of integration tests involve loading
+checkpoint. It is not easy to automatically generate checkpoints when doing
+integration tests. Currently we save a checkpoint file in the [test data
+folder](python/tests/testdata) of the ElasticDL Github repository and use this
+checkpoint file for integration tests. Thus you need to re-generate a new
+checkpoint file if your PR modifies the definition of Model protocol buffer.
+
+If you want to trigger Travis builds without submitting a pull request, you can
+do so by developing on a branch and add this branch name to the list in
+`branches` section in [Travis configuration file](../.travis.yml). Note that you
+can also trigger Travis builds from forks but it requires additional work such
+as activating Travis for the forked repo and MaxCompute related tests will be
+skipped as mentioned earlier.
diff --git a/elasticdl/python/elasticdl/README.md b/elasticdl/python/elasticdl/README.md