|
| 1 | +# ElasticDL Command-line Client Tool |
| 2 | + |
| 3 | +## Background |
| 4 | + |
| 5 | +ElasticDL is a Kubernetes-Native deep learning framework. As it runs |
| 6 | +distributed training/prediction/evaluation jobs in a cluster, we need a client |
| 7 | +to submit the jobs to the cluster. The main functionality of the client is |
| 8 | +*building image for ElasticDL job* and *submitting ElasticDL job*. |
| 9 | + |
| 10 | +Currently we have a client but it's tightly coupled with the main package. It's |
| 11 | +too heavy that users need pip install the whole elasticdl package and lots of |
| 12 | +dependencies such as TensorFlow, grpcio, etc. |
| 13 | + |
| 14 | +To improve the user experience, the client should be light-weight. It only has |
| 15 | +depedency on docker and Kubernetes Api. In this doc, we are discussing about |
| 16 | +this command-line client tool. |
| 17 | + |
| 18 | +## User Story |
| 19 | + |
| 20 | +1. Users develop model and the directory structure of model definition files |
| 21 | + is as follows: |
| 22 | + |
| 23 | + ```TEXT |
| 24 | + a_directory |
| 25 | + - wide_and_deep.py |
| 26 | + requirements.txt |
| 27 | + ``` |
| 28 | +
|
| 29 | +1. Generate a Dockerfile. |
| 30 | +
|
| 31 | + Input the command: |
| 32 | +
|
| 33 | + ```bash |
| 34 | + cd ${model_root_path} |
| 35 | + elasticdl zoo init [base_image_name] |
| 36 | + ``` |
| 37 | +
|
| 38 | + `base_image_name` is optional and the default value is `python`. |
| 39 | + The generated Dockerfile example is: |
| 40 | +
|
| 41 | + ```Dockerfile |
| 42 | + FROM python |
| 43 | + COPY . /model_zoo |
| 44 | + RUN pip install -r /model_zoo/requirements.txt |
| 45 | + RUN pip install elasticdl |
| 46 | + ``` |
| 47 | +
|
| 48 | + Users can make additional updates on the Dockerfile if necessary. |
| 49 | +
|
| 50 | +1. Build the Docker image for an ElasticDL job. |
| 51 | +
|
| 52 | + ```bash |
| 53 | + elasticdl zoo build --image=a_docker_registry/bright/elasticdl-wnd:1.0 . |
| 54 | + ``` |
| 55 | +
|
| 56 | +1. Push the Docker image to a remote registry (optional) |
| 57 | +
|
| 58 | + ```bash |
| 59 | + elasticdl zoo push a_docker_registry/bright/elasticdl-wnd:1.0 |
| 60 | + ``` |
| 61 | +
|
| 62 | +1. Submit a model training/prediction/evaluation job. |
| 63 | +
|
| 64 | + ```bash |
| 65 | + elasticdl train \ |
| 66 | + --image=a_docker_registry/bright/elasticdl-wnd:1.0 \ |
| 67 | + --model_def=a_directory.wide_and_deep.custom_model \ |
| 68 | + --training_data=/data/mnist/train \ |
| 69 | + --validation_data=/data/mnist/test \ |
| 70 | + --num_epochs=2 \ |
| 71 | + --minibatch_size=64 \ |
| 72 | + --num_ps_pods=1 \ |
| 73 | + --num_workers=1 \ |
| 74 | + --evaluation_steps=50 \ |
| 75 | + --job_name=test-mnist \ |
| 76 | + --distribution_strategy=ParameterServerStrategy \ |
| 77 | + --master_resource_request="cpu=0.2,memory=1024Mi" \ |
| 78 | + --master_resource_limit="cpu=1,memory=2048Mi" \ |
| 79 | + --worker_resource_request="cpu=0.4,memory=1024Mi" \ |
| 80 | + --worker_resource_limit="cpu=1,memory=2048Mi" \ |
| 81 | + --ps_resource_request="cpu=0.2,memory=1024Mi" \ |
| 82 | + --ps_resource_limit="cpu=1,memory=2048Mi" |
| 83 | + ``` |
0 commit comments