You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/cluster/opensource/cluster_train.md
+42-21Lines changed: 42 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1
# Cluster Training
2
2
3
-
We provide this simple scripts to help you to launch cluster training Job to harness PaddlePaddle's distributed trainning. For MPI and other cluster scheduler refer this naive script to implement more robust cluster training platform by yourself.
3
+
We provide some simple scripts```paddle/scripts/cluster_train``` to help you to launch cluster training Job to harness PaddlePaddle's distributed trainning. For MPI and other cluster scheduler refer this naive script to implement more robust cluster training platform by yourself.
4
4
5
-
The following cluster demo is based on RECOMMENDATION local training demo in PaddlePaddle ```demo/recommendation``` directory. Assuming you enter the cluster_scripts/ directory.
5
+
The following cluster demo is based on RECOMMENDATION local training demo in PaddlePaddle ```demo/recommendation``` directory. Assuming you enter the ```paddle/scripts/cluster_train/``` directory.
6
6
7
7
## Pre-requirements
8
8
@@ -12,9 +12,9 @@ Firstly,
12
12
pip install fabric
13
13
```
14
14
15
-
Secondly, go through installing scripts to install PaddlePaddle at all nodes to make sure demo can run as local mode.
15
+
Secondly, go through installing scripts to install PaddlePaddle at all nodes to make sure demo can run as local mode. For CUDA enabled training, we assume that CUDA is installed in ```/usr/local/cuda```, otherwise missed cuda runtime libraries error could be reported at cluster runtime. In one word, the local training environment should be well prepared for the simple scripts.
16
16
17
-
Then you should prepare same ROOT_DIR directory in all nodes. ROOT_DIR is from in cluster_scripts/conf.py. Assuming that the ROOT_DIR = /home/paddle, you can create ```paddle``` user account as well, at last ```paddle.py``` can ssh connections to all nodes with ```paddle``` user automatically.
17
+
Then you should prepare same ROOT_DIR directory in all nodes. ROOT_DIR is from in cluster_train/conf.py. Assuming that the ROOT_DIR = /home/paddle, you can create ```paddle``` user account as well, at last ```paddle.py``` can ssh connections to all nodes with ```paddle``` user automatically.
18
18
19
19
At last you can create ssh mutual trust relationship between all nodes for easy ssh login, otherwise ```password``` should be provided at runtime from ```paddle.py```.
20
20
@@ -28,35 +28,51 @@ Generally, you can use same model file from local training for cluster training.
28
28
29
29
Following steps are based on demo/recommendation demo in demo directory.
30
30
31
-
You just go through demo/recommendation tutorial doc until ```Train``` section, and at last you will get train/test data and model configuration file. Besides, you can place paddle binaries and related dependencies files in this demo/recommendation directory as well. Finaly, just use demo/recommendation as workspace for cluster training.
31
+
You just go through demo/recommendation tutorial doc until ```Train``` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.
32
32
33
33
At last your workspace should look like as follow:
34
34
```
35
35
.
36
-
|-- conf
37
-
| `-- trainer_config.conf
38
-
|-- test
39
-
| |-- dnn_instance_000000
40
-
|-- test.list
41
-
|-- train
42
-
| |-- dnn_instance_000000
43
-
| |-- dnn_instance_000001
44
-
`-- train.list
36
+
|-- common_utils.py
37
+
|-- data
38
+
| |-- config.json
39
+
| |-- config_generator.py
40
+
| |-- meta.bin
41
+
| |-- meta_config.json
42
+
| |-- meta_generator.py
43
+
| |-- ml-1m
44
+
| |-- ml_data.sh
45
+
| |-- ratings.dat.test
46
+
| |-- ratings.dat.train
47
+
| |-- split.py
48
+
| |-- test.list
49
+
| `-- train.list
50
+
|-- dataprovider.py
51
+
|-- evaluate.sh
52
+
|-- prediction.py
53
+
|-- preprocess.sh
54
+
|-- requirements.txt
55
+
|-- run.sh
56
+
`-- trainer_config.py
45
57
```
46
-
```conf/trainer_config.conf```
47
-
Indicates the model config file.
58
+
Not all of these files are needed for cluster training, but it's not necessary to remove useless files.
48
59
49
-
```test``` and ```train```
50
-
Train/test data. Different node should owns different parts of all Train data. This simple script did not do this job, so you should prepare it at last. All test data should be placed at node 0 only.
60
+
```trainer_config.py```
61
+
Indicates the model config file.
51
62
52
63
```train.list``` and ```test.list```
53
64
File index. It stores all relative or absolute file paths of all train/test data at current node.
54
65
66
+
```dataprovider.py```
67
+
used to read train/test samples. It's same as local training.
68
+
69
+
```data```
70
+
all files in data directory are refered by train.list/test.list which are refered by data provider.
55
71
56
72
57
73
## Prepare Cluster Job Configuration
58
74
59
-
Set serveral options must be carefully set in cluster_scripts/conf.py
75
+
The options below must be carefully set in cluster_train/conf.py
60
76
61
77
```HOSTS``` all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as [email protected]:9090.
62
78
@@ -70,6 +86,8 @@ Set serveral options must be carefully set in cluster_scripts/conf.py
70
86
71
87
```PADDLE_PORTS_NUM_FOR_SPARSE``` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like ```PADDLE_PORTS_NUM```
72
88
89
+
```LD_LIBRARY_PATH``` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.
90
+
73
91
Default Configuration as follow:
74
92
75
93
```python
@@ -96,6 +114,9 @@ PADDLE_PORT = 7164
96
114
PADDLE_PORTS_NUM=2
97
115
#pserver sparse ports num
98
116
PADDLE_PORTS_NUM_FOR_SPARSE=2
117
+
118
+
#environments setting for all processes in cluster job
```job_workspace``` set it with already deployed workspace directory, ```paddle.py``` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
108
129
dispatch latency.
109
130
110
-
```cluster_scripts/run.sh``` provides command line sample to run ```demo/recommendation``` cluster job, just modify ```job_dispatch_package``` and ```job_workspace``` with your defined directory, then:
131
+
```cluster_train/run.sh``` provides command line sample to run ```demo/recommendation``` cluster job, just modify ```job_dispatch_package``` and ```job_workspace``` with your defined directory, then:
111
132
```
112
133
sh run.sh
113
134
```
114
135
115
136
The cluster Job will start in several seconds.
116
137
117
138
### Kill Cluster Job
118
-
```paddle.py``` can capture ```Ctrl + C``` SIGINT signal to automatically kill all processes launched by it. So just stop ```paddle.py``` to kill cluster job.
139
+
```paddle.py``` can capture ```Ctrl + C``` SIGINT signal to automatically kill all processes launched by it. So just stop ```paddle.py``` to kill cluster job. You should mannally kill job if program crashed.
119
140
120
141
### Check Cluster Training Result
121
142
Check log in $workspace/log for details, each node owns same log structure.
0 commit comments