Skip to content

Commit 70597fe

Browse files
Add the client design doc. (#2073)
* Add client_tool design doc. * Do some rephrase * Do some rephrase * Update the bullet.
1 parent 3ad8193 commit 70597fe

File tree

1 file changed

+83
-0
lines changed

1 file changed

+83
-0
lines changed

docs/designs/client_tool.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# ElasticDL Command-line Client Tool
2+
3+
## Background
4+
5+
ElasticDL is a Kubernetes-Native deep learning framework. As it runs
6+
distributed training/prediction/evaluation jobs in a cluster, we need a client
7+
to submit the jobs to the cluster. The main functionality of the client is
8+
*building image for ElasticDL job* and *submitting ElasticDL job*.
9+
10+
Currently we have a client but it's tightly coupled with the main package. It's
11+
too heavy that users need pip install the whole elasticdl package and lots of
12+
dependencies such as TensorFlow, grpcio, etc.
13+
14+
To improve the user experience, the client should be light-weight. It only has
15+
depedency on docker and Kubernetes Api. In this doc, we are discussing about
16+
this command-line client tool.
17+
18+
## User Story
19+
20+
1. Users develop model and the directory structure of model definition files
21+
is as follows:
22+
23+
```TEXT
24+
a_directory
25+
- wide_and_deep.py
26+
requirements.txt
27+
```
28+
29+
1. Generate a Dockerfile.
30+
31+
Input the command:
32+
33+
```bash
34+
cd ${model_root_path}
35+
elasticdl zoo init [base_image_name]
36+
```
37+
38+
`base_image_name` is optional and the default value is `python`.
39+
The generated Dockerfile example is:
40+
41+
```Dockerfile
42+
FROM python
43+
COPY . /model_zoo
44+
RUN pip install -r /model_zoo/requirements.txt
45+
RUN pip install elasticdl
46+
```
47+
48+
Users can make additional updates on the Dockerfile if necessary.
49+
50+
1. Build the Docker image for an ElasticDL job.
51+
52+
```bash
53+
elasticdl zoo build --image=a_docker_registry/bright/elasticdl-wnd:1.0 .
54+
```
55+
56+
1. Push the Docker image to a remote registry (optional)
57+
58+
```bash
59+
elasticdl zoo push a_docker_registry/bright/elasticdl-wnd:1.0
60+
```
61+
62+
1. Submit a model training/prediction/evaluation job.
63+
64+
```bash
65+
elasticdl train \
66+
--image=a_docker_registry/bright/elasticdl-wnd:1.0 \
67+
--model_def=a_directory.wide_and_deep.custom_model \
68+
--training_data=/data/mnist/train \
69+
--validation_data=/data/mnist/test \
70+
--num_epochs=2 \
71+
--minibatch_size=64 \
72+
--num_ps_pods=1 \
73+
--num_workers=1 \
74+
--evaluation_steps=50 \
75+
--job_name=test-mnist \
76+
--distribution_strategy=ParameterServerStrategy \
77+
--master_resource_request="cpu=0.2,memory=1024Mi" \
78+
--master_resource_limit="cpu=1,memory=2048Mi" \
79+
--worker_resource_request="cpu=0.4,memory=1024Mi" \
80+
--worker_resource_limit="cpu=1,memory=2048Mi" \
81+
--ps_resource_request="cpu=0.2,memory=1024Mi" \
82+
--ps_resource_limit="cpu=1,memory=2048Mi"
83+
```

0 commit comments

Comments
 (0)