Skip to content

Commit a5a6c50

Browse files
authored
A tutorial for tf.estimator (#2528)
* Add tutorials for tf.estimator * Add a tutorial to how run a tf.estimator model using ElasticDL * Polish the tutorial * Polish the tutorial by comments
1 parent 5217774 commit a5a6c50

File tree

1 file changed

+223
-0
lines changed

1 file changed

+223
-0
lines changed

docs/tutorials/elasticdl_estimator.md

Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# Train TensorFlow Estimator Models using ElasticDL on Personal Computer
2+
3+
This document shows how to run an ElasticDL job to train a tf.estimator
4+
model using iris dataset on Minikube.
5+
6+
## Prerequisites
7+
8+
1. Install Minikube, preferably >= v1.11.0, following the installation
9+
[guide](https://kubernetes.io/docs/tasks/tools/install-minikube). Minikube
10+
runs a single-node Kubernetes cluster in a virtual machine on your personal
11+
computer.
12+
13+
1. Install Docker CE, preferably >= 18.x, following the
14+
[guide](https://docs.docker.com/docker-for-mac/install/) for building Docker
15+
images containing user-defined models and the ElasticDL framework.
16+
17+
1. Install Python, preferably >= 3.6, because the ElasticDL command-line tool is
18+
in Python.
19+
20+
## Models
21+
22+
Among all machine learning toolkits that ElasticDL can work with, TensorFlow is
23+
the most tested and used. In this tutorial, we use a model from the [model
24+
zoo](https://github.com/sql-machine-learning/elasticdl/tree/develop/model_zoo)
25+
directory. This model is defined using TensorFlow estimator API.
26+
27+
## Datasets
28+
29+
We use the [iris](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)
30+
dataset in this tutorial.
31+
32+
```bash
33+
mkdir ./data
34+
wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -O ./data/iris.data
35+
```
36+
37+
## The Kubernetes Cluster
38+
39+
The following command starts a Kubernetes cluster locally using Minikube. It
40+
uses [VirtualBox](https://www.virtualbox.org/), a hypervisor coming with
41+
macOS, to create the virtual machine cluster.
42+
43+
```bash
44+
minikube start --vm-driver=virtualbox \
45+
--cpus 2 --memory 6144 --disk-size=50gb
46+
eval $(minikube docker-env)
47+
```
48+
49+
The command `minikube docker-env` returns a set of Bash environment variable
50+
to configure your local environment to re-use the Docker daemon inside
51+
the Minikube instance.
52+
53+
The following command is necessary to enable
54+
[RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) of
55+
Kubernetes.
56+
57+
```bash
58+
kubectl apply -f \
59+
https://raw.githubusercontent.com/sql-machine-learning/elasticdl/develop/elasticdl/manifests/elasticdl-rbac.yaml
60+
```
61+
62+
If you happen to live in a region where `raw.githubusercontent.com` is banned,
63+
you might want to Git clone the above repository to get the YAML file.
64+
65+
## Install ElasticDL Client Tool
66+
67+
The following command installs the command line tool `elasticdl`, which talks to
68+
the Kubernetes cluster and operates ElasticDL jobs.
69+
70+
```bash
71+
pip install elasticdl_client
72+
```
73+
74+
## Build the Docker Image with Model Definition
75+
76+
Kubernetes runs Docker containers, so we need to put user-defined models,
77+
the ElasticDL api package and all dependencies into a Docker image.
78+
79+
In this tutorial, we use a predefined model in the ElasticDL repository. To
80+
retrieve the source code, please run the following command.
81+
82+
```bash
83+
git clone https://github.com/sql-machine-learning/elasticdl
84+
```
85+
86+
The estimator model definition is in directory [elasticdl/model_zoo/iris](https://github.com/sql-machine-learning/elasticdl/tree/develop/model_zoo/iris).
87+
88+
We build the image based on tensorflow:1.13.2 and the dockerfile is
89+
90+
```text
91+
FROM tensorflow/tensorflow:1.13.2-py3 as base
92+
93+
RUN pip install elasticdl_api
94+
95+
COPY ./model_zoo model_zoo
96+
```
97+
98+
Then, we use docker to build the image
99+
100+
```bash
101+
docker build -t elasticdl:iris_estimator -f ${iris_dockerfile} .
102+
```
103+
104+
## Submit the Training Job
105+
106+
The following command submits a training job:
107+
108+
```bash
109+
elasticdl train \
110+
--image_name=elasticdl:1.0.0 \
111+
--worker_image=elasticdl:iris_estimator \
112+
--ps_image=elasticdl:iris_estimator \
113+
--job_command="python -m model_zoo.iris.dnn_estimator" \
114+
--master_resource_request="cpu=0.2,memory=1024Mi" \
115+
--master_resource_limit="cpu=1,memory=2048Mi" \
116+
--num_ps=1 \
117+
--ps_resource_request="cpu=0.2,memory=1024Mi" \
118+
--ps_resource_limit="cpu=1,memory=2048Mi" \
119+
--num_workers=1 \
120+
--worker_resource_request="cpu=0.3,memory=1024Mi" \
121+
--worker_resource_limit="cpu=1,memory=2048Mi" \
122+
--chief_resource_request="cpu=0.3,memory=1024Mi" \
123+
--chief_resource_limit="cpu=1,memory=2048Mi" \
124+
--num_evaluator=1 \
125+
--evaluator_resource_request="cpu=0.3,memory=1024Mi" \
126+
--evaluator_resource_limit="cpu=1,memory=2048Mi" \
127+
--job_name=test-iris-estimator \
128+
--image_pull_policy=Never \
129+
--distribution_strategy=ParameterServerStrategy \
130+
--need_tf_config=true \
131+
--volume="host_path={iris_data_dir},mount_path=/data" \
132+
```
133+
134+
`--image_name` is the image to launch the ElasticDL master which
135+
has nothing to do with the estimator model. The ElasticDL master is
136+
responsible for launching pod and assigning data shards to workers with
137+
elasticity and fault-tolerance.
138+
139+
`{iris_data_dir}` is the absolute path of the `./data` with `iris.data`.
140+
Here, the option `--volume="host_path={iris_data_dir},mount_path=/data"`
141+
bind mount it into the containers/pods.
142+
143+
The option `--num_workers=1` tells the master to start a worker pod.
144+
The option `--num_ps=1` tells the master to start a ps pod.
145+
The option `--num_evaluator` tells the master to start an evaluator pod.
146+
147+
And the master will start a chief worker for a TensorFlow estiamtor model by default.
148+
149+
### Check Job Status
150+
151+
After the job submission, we can run the command `kubectl get pods` to list
152+
related containers.
153+
154+
```bash
155+
NAME READY STATUS RESTARTS AGE
156+
elasticdl-test-iris-estimator-master 1/1 Running 0 9s
157+
test-iris-estimator-edljob-chief-0 1/1 Running 0 6s
158+
test-iris-estimator-edljob-evaluator-0 0/1 Pending 0 6s
159+
test-iris-estimator-edljob-ps-0 1/1 Running 0 7s
160+
test-iris-estimator-edljob-worker-0 1/1 Running 0 6s
161+
```
162+
163+
## Train an Estimator Model Using ElasticDL with Your Dataset
164+
165+
You only need to modify your `input_fn` with ElasticDL DataShardService.
166+
The DataShardService will split the sample indices into ranges and assign
167+
those ranges to workers. The worker only need to read samples by indices
168+
in those ranges.
169+
170+
1. Create a DataShardService.
171+
172+
```python
173+
from elasticai_api.common.data_shard_service import build_data_shard_service
174+
175+
training_data_shard_svc = build_data_shard_service(
176+
batch_size=batch_size,
177+
num_epochs=100,
178+
dataset_size=len(rows),
179+
num_minibatches_per_shard=1,
180+
dataset_name="iris_training_data",
181+
)
182+
```
183+
184+
- batch_size: Batch size of each step.
185+
- num_epochs: The number of epochs.
186+
- dataset_size: The total number of samples in the dataset.
187+
- num_minibatches_per_shard: The number of batches in each shard.
188+
The number of samples in each shard is
189+
`batch_size * num_minibatches_per_shard`
190+
- dataset_name: The name of dataset.
191+
192+
2. Create a generator by reading samples with shards.
193+
194+
The `shard.start` and `shard.end` is the start index
195+
and end index of samples in those shard. You can read
196+
samples by the two indices like:
197+
198+
```python
199+
def train_generator(shard_service):
200+
while True:
201+
# Read samples by the range of the shard from
202+
# the data shard serice.
203+
shard = shard_service.fetch_shard()
204+
if not shard:
205+
break
206+
for i in range(shard.start, shard.end):
207+
label = CATEGORY_CODE[rows[i][-1]]
208+
yield rows[i][0:-1], [label]
209+
```
210+
211+
3. Create a session hook to report shard
212+
213+
```python
214+
from elasticai_api.tensorflow.hooks import ElasticDataShardReportHook
215+
216+
hooks = [
217+
ElasticDataShardReportHook(training_data_shard_svc),
218+
]
219+
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, hooks=hooks)
220+
```
221+
222+
After 3 steps, you can train your estimator models using ElasticDL
223+
in data parallel mode.

0 commit comments

Comments
 (0)