|
4 | 4 |
|
5 | 5 | For the typical synchronous distributed training, some significant steps are as follows:
|
6 | 6 |
|
7 |
| -1. A Trainer will compute the gradients and SEND them to the Parameter |
8 |
| -Server(PServer) nodes. |
9 |
| -1. After the PServer node received gradients came from all the Trainers, it would apply the gradient to the respective variables, and using an optimize algorithms(SGD, |
10 |
| - Momentment...) to update the parameters. |
11 |
| -1. The Trainer would wait for the PServers finished the optimize stage, and GET the parameters from PServer, so all the Trainers would get the same parameters. |
| 7 | +1. A Trainer will compute the gradients and SEND them to the Parameter Server(PServer) nodes. |
| 8 | +1. After the PServer node received gradients came from all the Trainers, |
| 9 | +it would apply the gradient to the respective variables, and using an optimize algorithms(SGD, |
| 10 | +Momentment...) to update the parameters. |
| 11 | +1. The Trainer would wait for the PServers finished the optimize stage, and GET the parameters from PServer, |
| 12 | +so all the Trainers would get the same parameters. |
12 | 13 |
|
13 | 14 | In the synchronously distributed training, there should be a `Barrier` to synchronise the
|
14 |
| -parameters after the optimizing stage. The performance of a distributed training job |
15 |
| -depends on the lowest node, if there were hundreds or thousand training nodes in a Job, |
16 |
| -the performance of synchronously distributed training might be very slow because of |
17 |
| -the slow node. So this design doc would introduce an approach to implement |
| 15 | +parameters after the optimizing stage. The performance of a distributed training job would |
| 16 | +depend on the slowest node if there were hundreds or thousands of training nodes in a |
| 17 | +Job, the performance of synchronously distributed training might be very poor because of |
| 18 | +the slow node. So this design doc would introduce an approach to implement |
18 | 19 | *asynchronously* distributed training in PaddlePaddle Fluid.
|
19 | 20 |
|
20 | 21 | ## Design
|
21 | 22 |
|
22 |
| -<img src="./src/async_update.png" width="450"/> |
| 23 | +<img src="./src/async_update.png" width="600"/> |
23 | 24 |
|
24 | 25 | As the figure above, we describe a global view of asynchronously update process and use
|
25 | 26 | the parameter `w1` as an example to introduce the steps:
|
26 | 27 | 1. For each gradient variables, they may distribute on different GPU card and aggregate
|
27 | 28 | them while they are all calculated.
|
28 | 29 | 1. Split the gradient variable into multiple blocks according to the number of PServer
|
29 |
| -instances and sent them. |
30 |
| -1. PServer would run an `Optimize Block` to use a specified optimize algorithm to update |
31 |
| -the specified parameter, such as `w1`. |
32 |
| -1. The trainer will fetch the latest parameter after PServer finished the optimize stage. |
| 30 | +instances and then sent them. |
| 31 | +1. PServer would run an `Optimize Block` using a specified optimize algorithm to update |
| 32 | +the specified parameter. |
| 33 | +1. The trainer will fetch the parameter before running forward Op depends on the specified |
| 34 | +parameter. |
33 | 35 | 1. Broadcast the received variable into multiple GPU cards and continue to run the next
|
34 | 36 | mini-batch.
|
35 | 37 |
|
36 | 38 | ### Trainer
|
37 | 39 |
|
38 |
| -- We need a new Operator named `RemoteOptimize` to send gradients to multiple PServer |
39 |
| -instances and fetch the latest parameter. |
| 40 | +- For the multiple devices distributed training, we need to aggregate the gradient |
| 41 | +variables which placed on different devices firstly, and then schedule a `SendVars` Operator to |
| 42 | +send the gradient variables to the multiple PServer instances. |
| 43 | +- Schedule `FetchVars` operator to fetch the latest parameter from PServer before running |
| 44 | +the forward ops. |
40 | 45 | - There could be a large number of gradient variables to be sent, so we need to use another
|
41 |
| -thread pool(IO Threadpool) which number of the schedulable threads is larger than the |
| 46 | +thread pool(IO Threadpool) which a number of the schedulable threads is larger than the |
42 | 47 | computing thread pool to avoid competitive the thread resources with computing.
|
43 | 48 |
|
44 | 49 | ### Parameter Server
|
|
0 commit comments