You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[Cluster Training Using Kubernetes](#cluster-training-using-kubernetes)
21
21
22
-
# Introduction
22
+
##Introduction
23
23
24
24
In this article, we'll explain how to run distributed training jobs with PaddlePaddle on different types of clusters. The diagram below shows the main architecture of a distributed trainning job:
25
25
@@ -33,7 +33,7 @@ PaddlePaddle can support both synchronize stochastic gradient descent (SGD) and
33
33
34
34
When training with synchronize SGD, PaddlePaddle uses an internal "synchronize barrier" which makes gradients update and parameter download in strict order. On the other hand, asynchronous SGD won't wait for all trainers to finish upload at a single step, this will increase the parallelism of distributed training: parameter servers do not depend on each other, they'll do parameter optimization concurrently. Parameter servers will not wait for trainers, so trainers will also do their work concurrently. But asynchronous SGD will introduce more randomness and noises in the gradient.
35
35
36
-
# Preparations
36
+
##Preparations
37
37
1. Prepare your computer cluster. It's normally a bunch of Linux servers connected by LAN. Each server will be assigned a unique IP address. The computers in the cluster can be called "nodes".
38
38
2. Install PaddlePaddle on every node. If you are going to take advantage of GPU cards, you'll also need to install proper driver and CUDA libraries. To install PaddlePaddle please read [this build and install](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/getstarted/build_and_install) document. We strongly recommend using [Docker installation](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst).
39
39
@@ -52,9 +52,9 @@ PaddlePaddle 0.10.0rc, compiled with
52
52
53
53
We'll take `doc/howto/usage/cluster/src/word2vec` as an example to introduce distributed training using PaddlePaddle v2 API.
54
54
55
-
# Command-line arguments
55
+
##Command-line arguments
56
56
57
-
## Starting parameter server
57
+
###Starting parameter server
58
58
59
59
Type the below command to start a parameter server which will wait for trainers to connect:
| ports_num_for_sparse | required | 1 | number of ports which serves sparse parameter update |
75
75
| num_gradient_servers | required | 1 | total number of gradient servers |
76
76
77
-
## Starting trainer
77
+
###Starting trainer
78
78
Type the command below to start the trainer(name the file whatever you want, like "train.py")
79
79
80
80
```bash
@@ -122,7 +122,7 @@ paddle.init(
122
122
| trainer_id | required | 0 | ID for every trainer, start from 0 |
123
123
| pservers | required | 127.0.0.1 | list of IPs of parameter servers, separated by "," |
124
124
125
-
## Prepare Training Dataset
125
+
###Prepare Training Dataset
126
126
127
127
Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files.
128
128
@@ -155,7 +155,7 @@ When job started, every trainer needs to get it's own part of data. In some dist
155
155
156
156
Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job.
157
157
158
-
## Prepare Training program
158
+
###Prepare Training program
159
159
160
160
We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.
161
161
@@ -191,7 +191,7 @@ Your workspace may looks like:
191
191
-`train_data_dir`: containing training data. Mount from storage service or copy trainning data to here.
192
192
-`test_data_dir`: containing testing data.
193
193
194
-
# Use cluster platforms or cluster management tools
194
+
##Use cluster platforms or cluster management tools
195
195
196
196
PaddlePaddle supports running jobs on several platforms including:
197
197
-[Kubernetes](http://kubernetes.io) open-source system for automating deployment, scaling, and management of containerized applications from Google.
@@ -202,13 +202,13 @@ We'll introduce cluster job management on these platforms. The examples can be f
202
202
203
203
These cluster platforms provide API or environment variables for training processes, when the job is dispatched to different nodes. Like node ID, IP or total number of nodes etc.
204
204
205
-
## Cluster Training Using Fabric
205
+
###Cluster Training Using Fabric
206
206
207
-
### Prepare a Linux cluster
207
+
####Prepare a Linux cluster
208
208
209
209
Run `kubectl -f ssh_servers.yaml` under the directory: `paddle/scripts/cluster_train_v2/fabric/docker_cluster` will launch a demo cluster. Run `kubectl get po -o wide` to get IP addresses of these nodes.
210
210
211
-
### Launching Cluster Job
211
+
####Launching Cluster Job
212
212
`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can be set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.
213
213
214
214
`paddle.py`provides two distinguished command option for easy job launching.
@@ -224,10 +224,10 @@ sh run.sh
224
224
225
225
The cluster Job will start in several seconds.
226
226
227
-
### Kill Cluster Job
227
+
####Kill Cluster Job
228
228
`paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should manually kill the job if the program crashed.
229
229
230
-
### Check Cluster Training Result
230
+
####Check Cluster Training Result
231
231
Check log in $workspace/log for details, each node owns same log structure.
232
232
233
233
`paddle_trainer.INFO`
@@ -242,13 +242,13 @@ It provides stderr and stdout of parameter server process. Check error log if tr
242
242
`train.log`
243
243
It provides stderr and stdout of trainer process. Check error log if training crashes.
244
244
245
-
### Check Model Output
245
+
####Check Model Output
246
246
After one pass finished, model files will be written in `output` directory in node 0.
247
247
`nodefile` in workspace indicates the node id of current cluster job.
248
248
249
-
## Cluster Training Using OpenMPI
249
+
###Cluster Training Using OpenMPI
250
250
251
-
### Prepare an OpenMPI cluster
251
+
####Prepare an OpenMPI cluster
252
252
253
253
Run the following command to start a 3-node MPI cluster and one "head" node.
0 commit comments