Skip to content

Commit 5a74bd7

Browse files
authored
A PyTorch tutorial (#2530)
* Add tutorials for pytorch * Add a tutorial of Pytorch
1 parent 3558e62 commit 5a74bd7

File tree

2 files changed

+261
-30
lines changed

2 files changed

+261
-30
lines changed

docs/tutorials/elasticdl_torch.md

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
# Train PyTorch Models using ElasticDL on Personal Computer
2+
3+
This document shows how to run an ElasticDL AllReduce job to train a PyTorch
4+
model using MNIST dataset on Minikube.
5+
6+
## Prerequisites
7+
8+
1. Install Minikube, preferably >= v1.11.0, following the installation
9+
[guide](https://kubernetes.io/docs/tasks/tools/install-minikube). Minikube
10+
runs a single-node Kubernetes cluster in a virtual machine on your personal
11+
computer.
12+
13+
1. Install Docker CE, preferably >= 18.x, following the
14+
[guide](https://docs.docker.com/docker-for-mac/install/) for building Docker
15+
images containing user-defined models and the ElasticDL framework.
16+
17+
1. Install Python, preferably >= 3.6, because the ElasticDL command-line tool is
18+
in Python.
19+
20+
## Models
21+
22+
In this tutorial, we use the model defined in the [model
23+
zoo](https://github.com/sql-machine-learning/elasticdl/tree/develop/model_zoo/mnist/mnist_pytorch.py)
24+
directory. This model is defined using PyTorch API.
25+
26+
## Datasets
27+
28+
We use the [MINST](https://www.kaggle.com/jidhumohan/mnist-png/download)
29+
dataset in this tutorial. After downloading the dataset, we should
30+
unzip it into a local directory.
31+
32+
## The Kubernetes Cluster
33+
34+
The following command starts a Kubernetes cluster locally using Minikube. It
35+
uses [VirtualBox](https://www.virtualbox.org/), a hypervisor coming with
36+
macOS, to create the virtual machine cluster.
37+
38+
```bash
39+
minikube start --vm-driver=virtualbox \
40+
--cpus 2 --memory 6144 --disk-size=50gb
41+
eval $(minikube docker-env)
42+
```
43+
44+
The command `minikube docker-env` returns a set of Bash environment variable
45+
to configure your local environment to re-use the Docker daemon inside
46+
the Minikube instance.
47+
48+
The following command is necessary to enable
49+
[RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) of
50+
Kubernetes.
51+
52+
```bash
53+
kubectl apply -f \
54+
https://raw.githubusercontent.com/sql-machine-learning/elasticdl/develop/elasticdl/manifests/elasticdl-rbac.yaml
55+
```
56+
57+
If you happen to live in a region where `raw.githubusercontent.com` is banned,
58+
you might want to Git clone the above repository to get the YAML file.
59+
60+
## Install ElasticDL Client Tool
61+
62+
The following command installs the command line tool `elasticdl`, which talks to
63+
the Kubernetes cluster and operates ElasticDL jobs.
64+
65+
```bash
66+
pip install elasticdl_client
67+
```
68+
69+
## Build the Docker Image with Model Definition
70+
71+
Kubernetes runs Docker containers, so we need to put user-defined models,
72+
the ElasticDL api package and all dependencies into a Docker image.
73+
74+
In this tutorial, we use a predefined model in the ElasticDL repository. To
75+
retrieve the source code, please run the following command.
76+
77+
```bash
78+
git clone https://github.com/sql-machine-learning/elasticdl
79+
```
80+
81+
The model definition is in directory [elasticdl/model_zoo/mnist/mnist_pytorch.py](https://github.com/sql-machine-learning/elasticdl/blob/develop/model_zoo/mnist/mnist_pytorch.py).
82+
83+
We build the image based on `horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu`
84+
and the dockerfile is
85+
86+
```text
87+
FROM horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu as base
88+
89+
RUN pip install opencv-python
90+
RUN apt update
91+
RUN apt install -y libgl1-mesa-glx libglib2.0-dev
92+
93+
RUN HOROVOD_WITHOUT_MPI=1 \
94+
HOROVOD_WITH_GLOO=1 \
95+
HOROVOD_WITHOUT_MXNET=1 \
96+
HOROVOD_WITH_TENSORFLOW=1 \
97+
HOROVOD_WITH_PYTORCH=1 \
98+
pip install horovod==0.21.0
99+
100+
RUN pip install elasticdl_api
101+
102+
COPY ./model_zoo model_zoo
103+
ENV PYTHONUNBUFFERED 0
104+
```
105+
106+
Then, we use docker to build the image
107+
108+
```bash
109+
docker build -t elasticdl:mnist_pytorch -f ${mnist_dockerfile} .
110+
```
111+
112+
## Submit the Training Job
113+
114+
The following command submits a training job:
115+
116+
```bash
117+
elasticdl train \
118+
--image_name=elasticdl/elasticdl:v1.0.0 \
119+
--worker_image=elasticdl:mnsit_pytorch \
120+
--job_command="python -m model_zoo.mnist.mnist_pytorch --batch_size 64 --num_epochs 1 --training_data=/data/mnist_png/training --validation_data=/data/mnist_png/testing" \
121+
--num_minibatches_per_task=2 \
122+
--num_workers=2 \
123+
--master_resource_request="cpu=0.2,memory=1024Mi" \
124+
--master_resource_limit="cpu=1,memory=2048Mi" \
125+
--worker_resource_request="cpu=0.3,memory=1024Mi" \
126+
--worker_resource_limit="cpu=1,memory=2048Mi" \
127+
--envs="USE_TORCH=true,HOROVOD_GLOO_TIMEOUT_SECONDS=60,PYTHONUNBUFFERED=true" \
128+
--job_name=test-mnist-allreduce \
129+
--image_pull_policy=Never \
130+
--volume="host_path={mnist_data_dir},mount_path=/data" \
131+
--distribution_strategy=AllreduceStrategy
132+
```
133+
134+
`--image_name` is the image to launch the ElasticDL master which
135+
has nothing to do with the estimator model. The ElasticDL master is
136+
responsible for launching pod and assigning data shards to workers with
137+
elasticity and fault-tolerance.
138+
139+
`{mnist_data_dir}` is the absolute path of the `./data` with the directory of
140+
`mnist_png`. Here, the option `--volume="host_path={mnist_data_dir},mount_path=/data"`
141+
bind mount it into the containers/pods.
142+
143+
The option `--num_workers=2` tells the master to start 2 worker pods.
144+
145+
### Check Job Status
146+
147+
After the job submission, we can run the command `kubectl get pods` to list
148+
related containers.
149+
150+
```bash
151+
NAME READY STATUS RESTARTS AGE
152+
elasticdl-test-mnist-allreduce-master 1/1 Running 0 7s
153+
test-mnist-allreduce-edljob-worker-0 1/1 Running 0 5s
154+
test-mnist-allreduce-edljob-worker-1 1/1 Running 0 5s
155+
```
156+
157+
## Train an PyTorch Model Using ElasticDL with Your Dataset
158+
159+
In order to support fault-tolerance and elasticity with ElasticDL, you only
160+
need to create a custom dataset and wrap the function to perform forward and
161+
backward computation using ElasticDL APIs.
162+
163+
### Create a Dataset With the RecordIndexService of ElasticDL
164+
165+
ElasticDL can split the total dataset into shards and assign those
166+
shards to workers. If some workers fail, ElasticDL can re-assign
167+
shards of failed workers to other running workers. We can get sample
168+
indices in those shards by `RecordIndexService`. We can create a
169+
dataset to read images by indices from the `RecordIndexService`.
170+
171+
```python
172+
class ElasticDataset(Dataset):
173+
def __init__(self, images, data_shard_service=None):
174+
"""The dataset supports elastic training.
175+
176+
Args:
177+
images: A list with tuples like (image_path, label_index).
178+
For example, we can use `torchvision.datasets.ImageFolder`
179+
to get the list.
180+
data_shard_service: If we want to use elastic training, we
181+
need to use the `data_shard_service` of the elastic controller
182+
in elasticai_api.
183+
"""
184+
self.data_shard_service = data_shard_service
185+
self._images = images
186+
187+
def __len__(self):
188+
if self.data_shard_service:
189+
# Set the maxsize because the size of dataset is not fixed
190+
# when using dynamic sharding
191+
return sys.maxsize
192+
else:
193+
return len(self._images)
194+
195+
def __getitem__(self, index):
196+
if self.data_shard_service:
197+
index = self.data_shard_service.fetch_record_index()
198+
return self.read_image(index)
199+
else:
200+
return self.read_image(index)
201+
202+
def read_image(self, index):
203+
image_path, label = self._images[index]
204+
image = cv2.imread(image_path)
205+
image = np.array(image / 255.0, np.float32)
206+
image = image.reshape(3, 28, 28)
207+
return image, label
208+
209+
210+
if __name__ == "__main__":
211+
...
212+
data_shard_service = RecordIndexService(
213+
batch_size=args.batch_size,
214+
dataset_size=len(train_data.imgs),
215+
num_epochs=args.num_epochs,
216+
shuffle=True,
217+
dataset_name="mnist_training_data",
218+
)
219+
train_dataset = ElasticDataset(train_data.imgs, data_shard_service)
220+
...
221+
```
222+
223+
### Create an ElasticDL Controller to Wrap the Forward and Backward Computation
224+
225+
In ElasticDL AllReduce, we need to create a `PyTorchAllReduceController`
226+
of ElasticDL. At the begining, The controller can broadcast the model and
227+
optimizer. If some workers fail, the controller will re-initialize Horovod
228+
using a new world. After creating the controller, we should wrap the function
229+
to perform the forward and backward computation by `elastic_run`.
230+
231+
```python
232+
def train_one_batch(model, optimizer, data, target):
233+
optimizer.zero_grad()
234+
output = model(data)
235+
loss = F.nll_loss(output, target)
236+
loss.backward()
237+
optimizer.step()
238+
return loss
239+
240+
241+
if __name__ == "__main__":
242+
...
243+
model = ...
244+
optimizer = ...
245+
allreduce_controller = PyTorchAllReduceController(data_shard_service)
246+
allreduce_controller.set_broadcast_model(model)
247+
allreduce_controller.set_broadcast_optimizer(optimizer)
248+
# Use the elastic function to wrap the training function with a batch.
249+
elastic_train_one_batch = allreduce_controller.elastic_run(train_one_batch)
250+
251+
with allreduce_controller.scope():
252+
for batch_idx, (data, target) in enumerate(train_loader):
253+
loss = elastic_train_one_batch(model, optimizer, data, target)
254+
...
255+
```

model_zoo/mnist/mnist_pytorch.py

Lines changed: 6 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -11,31 +11,6 @@
1111
# See the License for the specific language governing permissions and
1212
# limitations under the License.
1313

14-
"""
15-
Download the mnist dataset from
16-
https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz
17-
and then untar it into ${data_store_dir}. Using minikube, we can use the
18-
following command to submit a training job with these codes.
19-
20-
elasticdl train \
21-
--image_name=elasticdl:pt_mnist_allreduce \
22-
--job_command="python -m model_zoo.mnist.mnist_pytorch \
23-
--training_data=/local_data/mnist_png/training \
24-
--validation_data=/local_data/mnist_png/testing" \
25-
--num_minibatches_per_task=2 \
26-
--num_workers=2 \
27-
--worker_pod_priority=0.5 \
28-
--master_resource_request="cpu=0.2,memory=1024Mi" \
29-
--master_resource_limit="cpu=1,memory=2048Mi" \
30-
--worker_resource_request="cpu=0.3,memory=1024Mi" \
31-
--worker_resource_limit="cpu=1,memory=2048Mi" \
32-
--envs="USE_TORCH=true,HOROVOD_GLOO_TIMEOUT_SECONDS=60,PYTHONUNBUFFERED=0" \
33-
--job_name=test-mnist-allreduce \
34-
--image_pull_policy=Never \
35-
--volume="host_path=${data_store_dir},mount_path=/local_data" \
36-
--distribution_strategy=AllreduceStrategy \
37-
"""
38-
3914
import argparse
4015
import sys
4116

@@ -49,7 +24,8 @@
4924
from torch.optim.lr_scheduler import StepLR
5025
from torch.utils.data import DataLoader, Dataset
5126

52-
from elasticai_api.pytorch.controller import create_elastic_controller
27+
from elasticai_api.common.data_shard_service import RecordIndexService
28+
from elasticai_api.pytorch.controller import PyTorchAllReduceController
5329
from elasticai_api.pytorch.optimizer import DistributedOptimizer
5430

5531

@@ -131,15 +107,14 @@ def train(args):
131107
train_data = torchvision.datasets.ImageFolder(args.training_data)
132108
test_data = torchvision.datasets.ImageFolder(args.validation_data)
133109

134-
allreduce_controller = create_elastic_controller(
110+
data_shard_service = RecordIndexService(
135111
batch_size=args.batch_size,
136112
dataset_size=len(train_data.imgs),
137113
num_epochs=args.num_epochs,
138114
shuffle=True,
115+
dataset_name="mnist_training_data",
139116
)
140-
train_dataset = ElasticDataset(
141-
train_data.imgs, allreduce_controller.data_shard_service
142-
)
117+
train_dataset = ElasticDataset(train_data.imgs, data_shard_service)
143118
train_loader = DataLoader(
144119
dataset=train_dataset, batch_size=args.batch_size, num_workers=2
145120
)
@@ -155,6 +130,7 @@ def train(args):
155130
scheduler = StepLR(optimizer, step_size=1, gamma=0.5)
156131

157132
# Set the model and optimizer to broadcast.
133+
allreduce_controller = PyTorchAllReduceController(data_shard_service)
158134
allreduce_controller.set_broadcast_model(model)
159135
allreduce_controller.set_broadcast_optimizer(optimizer)
160136
epoch = 0

0 commit comments

Comments
 (0)