Skip to content

Commit cecdede

Browse files
author
wangyanfei01
committed
ISSUE=4607611 refine cluster scripts
git-svn-id: https://svn.baidu.com/idl/trunk/paddle@1461 1ad973e4-5ce8-4261-8a94-b56d1f490c56
1 parent 88c6486 commit cecdede

File tree

3 files changed

+67
-35
lines changed

3 files changed

+67
-35
lines changed

doc/cluster/opensource/cluster_train.md

Lines changed: 42 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Cluster Training
22

3-
We provide this simple scripts to help you to launch cluster training Job to harness PaddlePaddle's distributed trainning. For MPI and other cluster scheduler refer this naive script to implement more robust cluster training platform by yourself.
3+
We provide some simple scripts ```paddle/scripts/cluster_train``` to help you to launch cluster training Job to harness PaddlePaddle's distributed trainning. For MPI and other cluster scheduler refer this naive script to implement more robust cluster training platform by yourself.
44

5-
The following cluster demo is based on RECOMMENDATION local training demo in PaddlePaddle ```demo/recommendation``` directory. Assuming you enter the cluster_scripts/ directory.
5+
The following cluster demo is based on RECOMMENDATION local training demo in PaddlePaddle ```demo/recommendation``` directory. Assuming you enter the ```paddle/scripts/cluster_train/``` directory.
66

77
## Pre-requirements
88

@@ -12,9 +12,9 @@ Firstly,
1212
pip install fabric
1313
```
1414

15-
Secondly, go through installing scripts to install PaddlePaddle at all nodes to make sure demo can run as local mode.
15+
Secondly, go through installing scripts to install PaddlePaddle at all nodes to make sure demo can run as local mode. For CUDA enabled training, we assume that CUDA is installed in ```/usr/local/cuda```, otherwise missed cuda runtime libraries error could be reported at cluster runtime. In one word, the local training environment should be well prepared for the simple scripts.
1616

17-
Then you should prepare same ROOT_DIR directory in all nodes. ROOT_DIR is from in cluster_scripts/conf.py. Assuming that the ROOT_DIR = /home/paddle, you can create ```paddle``` user account as well, at last ```paddle.py``` can ssh connections to all nodes with ```paddle``` user automatically.
17+
Then you should prepare same ROOT_DIR directory in all nodes. ROOT_DIR is from in cluster_train/conf.py. Assuming that the ROOT_DIR = /home/paddle, you can create ```paddle``` user account as well, at last ```paddle.py``` can ssh connections to all nodes with ```paddle``` user automatically.
1818

1919
At last you can create ssh mutual trust relationship between all nodes for easy ssh login, otherwise ```password``` should be provided at runtime from ```paddle.py```.
2020

@@ -28,35 +28,51 @@ Generally, you can use same model file from local training for cluster training.
2828

2929
Following steps are based on demo/recommendation demo in demo directory.
3030

31-
You just go through demo/recommendation tutorial doc until ```Train``` section, and at last you will get train/test data and model configuration file. Besides, you can place paddle binaries and related dependencies files in this demo/recommendation directory as well. Finaly, just use demo/recommendation as workspace for cluster training.
31+
You just go through demo/recommendation tutorial doc until ```Train``` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.
3232

3333
At last your workspace should look like as follow:
3434
```
3535
.
36-
|-- conf
37-
| `-- trainer_config.conf
38-
|-- test
39-
| |-- dnn_instance_000000
40-
|-- test.list
41-
|-- train
42-
| |-- dnn_instance_000000
43-
| |-- dnn_instance_000001
44-
`-- train.list
36+
|-- common_utils.py
37+
|-- data
38+
| |-- config.json
39+
| |-- config_generator.py
40+
| |-- meta.bin
41+
| |-- meta_config.json
42+
| |-- meta_generator.py
43+
| |-- ml-1m
44+
| |-- ml_data.sh
45+
| |-- ratings.dat.test
46+
| |-- ratings.dat.train
47+
| |-- split.py
48+
| |-- test.list
49+
| `-- train.list
50+
|-- dataprovider.py
51+
|-- evaluate.sh
52+
|-- prediction.py
53+
|-- preprocess.sh
54+
|-- requirements.txt
55+
|-- run.sh
56+
`-- trainer_config.py
4557
```
46-
```conf/trainer_config.conf```
47-
Indicates the model config file.
58+
Not all of these files are needed for cluster training, but it's not necessary to remove useless files.
4859

49-
```test``` and ```train```
50-
Train/test data. Different node should owns different parts of all Train data. This simple script did not do this job, so you should prepare it at last. All test data should be placed at node 0 only.
60+
```trainer_config.py```
61+
Indicates the model config file.
5162

5263
```train.list``` and ```test.list```
5364
File index. It stores all relative or absolute file paths of all train/test data at current node.
5465

66+
```dataprovider.py```
67+
used to read train/test samples. It's same as local training.
68+
69+
```data```
70+
all files in data directory are refered by train.list/test.list which are refered by data provider.
5571

5672

5773
## Prepare Cluster Job Configuration
5874

59-
Set serveral options must be carefully set in cluster_scripts/conf.py
75+
The options below must be carefully set in cluster_train/conf.py
6076

6177
```HOSTS``` all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as [email protected]:9090.
6278

@@ -70,6 +86,8 @@ Set serveral options must be carefully set in cluster_scripts/conf.py
7086

7187
```PADDLE_PORTS_NUM_FOR_SPARSE``` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like ```PADDLE_PORTS_NUM```
7288

89+
```LD_LIBRARY_PATH``` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.
90+
7391
Default Configuration as follow:
7492

7593
```python
@@ -96,6 +114,9 @@ PADDLE_PORT = 7164
96114
PADDLE_PORTS_NUM = 2
97115
#pserver sparse ports num
98116
PADDLE_PORTS_NUM_FOR_SPARSE = 2
117+
118+
#environments setting for all processes in cluster job
119+
LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"
99120
```
100121

101122
### Launching Cluster Job
@@ -107,15 +128,15 @@ PADDLE_PORTS_NUM_FOR_SPARSE = 2
107128
```job_workspace``` set it with already deployed workspace directory, ```paddle.py``` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
108129
dispatch latency.
109130

110-
```cluster_scripts/run.sh``` provides command line sample to run ```demo/recommendation``` cluster job, just modify ```job_dispatch_package``` and ```job_workspace``` with your defined directory, then:
131+
```cluster_train/run.sh``` provides command line sample to run ```demo/recommendation``` cluster job, just modify ```job_dispatch_package``` and ```job_workspace``` with your defined directory, then:
111132
```
112133
sh run.sh
113134
```
114135

115136
The cluster Job will start in several seconds.
116137

117138
### Kill Cluster Job
118-
```paddle.py``` can capture ```Ctrl + C``` SIGINT signal to automatically kill all processes launched by it. So just stop ```paddle.py``` to kill cluster job.
139+
```paddle.py``` can capture ```Ctrl + C``` SIGINT signal to automatically kill all processes launched by it. So just stop ```paddle.py``` to kill cluster job. You should mannally kill job if program crashed.
119140

120141
### Check Cluster Training Result
121142
Check log in $workspace/log for details, each node owns same log structure.

paddle/scripts/cluster_train/conf.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,6 @@
3535
PADDLE_PORTS_NUM = 2
3636
#pserver sparse ports num
3737
PADDLE_PORTS_NUM_FOR_SPARSE = 2
38+
39+
#environments setting for all processes in cluster job
40+
LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"

paddle/scripts/cluster_train/paddle.py

Lines changed: 22 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
import signal
2525

2626

27-
from fabric.api import run, put, settings, env
27+
from fabric.api import run, put, settings, env, prefix
2828
from fabric.tasks import execute
2929

3030
#configuration for cluster
@@ -112,12 +112,15 @@ def start_pserver(jobdir, pargs):
112112
'''
113113
start pserver process with fabric executor
114114
'''
115-
program = 'paddle pserver'
116-
run('cd ' + jobdir + '; ' + \
117-
'GLOG_logtostderr=0 GLOG_log_dir="./log" ' + \
118-
'nohup ' + \
119-
program + " " + pargs + ' > ./log/server.log 2>&1 < /dev/null & ',
120-
pty=False)
115+
with prefix('export LD_LIBRARY_PATH=' + \
116+
conf.LD_LIBRARY_PATH + \
117+
':$LD_LIBRARY_PATH'):
118+
program = 'paddle pserver'
119+
run('cd ' + jobdir + '; ' + \
120+
'GLOG_logtostderr=0 GLOG_log_dir="./log" ' + \
121+
'nohup ' + \
122+
program + " " + pargs + ' > ./log/server.log 2>&1 < /dev/null & ',
123+
pty=False)
121124

122125
execute(start_pserver, jobdir, pargs, hosts=conf.HOSTS)
123126

@@ -152,13 +155,16 @@ def start_trainer(jobdir, args):
152155
'''
153156
start trainer process with fabric executor
154157
'''
155-
program = 'paddle train'
156-
run('cd ' + jobdir + '; ' + \
157-
'GLOG_logtostderr=0 '
158-
'GLOG_log_dir="./log" '
159-
'nohup ' + \
160-
program + " " + args + " > ./log/train.log 2>&1 < /dev/null & ",
161-
pty=False)
158+
with prefix('export LD_LIBRARY_PATH=' + \
159+
conf.LD_LIBRARY_PATH + \
160+
':$LD_LIBRARY_PATH'):
161+
program = 'paddle train'
162+
run('cd ' + jobdir + '; ' + \
163+
'GLOG_logtostderr=0 '
164+
'GLOG_log_dir="./log" '
165+
'nohup ' + \
166+
program + " " + args + " > ./log/train.log 2>&1 < /dev/null & ",
167+
pty=False)
162168

163169
for i in xrange(len(conf.HOSTS)):
164170
train_args = copy.deepcopy(args)
@@ -230,3 +236,5 @@ def kill_process():
230236
job_all(args.job_dispatch_package,
231237
None,
232238
train_args_dict)
239+
else:
240+
print "--job_workspace or --job_dispatch_package should be set"

0 commit comments

Comments
 (0)