You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- ports_num: **required, default 1**, total number of ports will listen on.
22
23
- ports_num_for_sparse: **required, default 0**, number of ports which serves sparse parameter update.
23
24
- num_gradient_servers: **required, default 1**, total number of gradient servers.
25
+
- nics: **optional, default xgbe0,xgbe1**, network device name which paramter server will listen on.
26
+
27
+
## Starting trainer
24
28
25
-
### Starting trainer
26
29
Type the command below to start the trainer(name the file whatever you want, like "train.py")
27
30
28
31
```bash
@@ -70,7 +73,7 @@ Parameter Description
70
73
- trainer_id: **required, default 0**, ID for every trainer, start from 0.
71
74
- pservers: **required, default 127.0.0.1**, list of IPs of parameter servers, separated by ",".
72
75
73
-
###Prepare Training Dataset
76
+
## Prepare Training Dataset
74
77
75
78
Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files.
76
79
@@ -88,7 +91,7 @@ for f in flist:
88
91
89
92
Example code `prepare.py` will split training data and testing data into 3 files with digital suffix like `-00000`, `-00001` and`-00002`:
90
93
91
-
```
94
+
```bash
92
95
train.txt
93
96
train.txt-00000
94
97
train.txt-00001
@@ -103,13 +106,13 @@ When job started, every trainer needs to get it's own part of data. In some dist
103
106
104
107
Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job.
105
108
106
-
###Prepare Training program
109
+
## Prepare Training program
107
110
108
111
We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.
109
112
110
-
111
113
Your workspace may looks like:
112
-
```
114
+
115
+
```bash
113
116
.
114
117
|-- my_lib.py
115
118
|-- word_dict.pickle
@@ -138,3 +141,21 @@ Your workspace may looks like:
138
141
139
142
-`train_data_dir`: containing training data. Mount from storage service or copy trainning data to here.
140
143
-`test_data_dir`: containing testing data.
144
+
145
+
## Async SGD Update
146
+
147
+
We can set some parameters of the optimizer to make it support async SGD update.
148
+
For example, we can set the `is_async` and `async_lagged_grad_discard_ratio` of the `AdaGrad` optimizer:
0 commit comments