|
| 1 | +# Design Doc: Asynchronous Update With Distributed Training |
| 2 | + |
| 3 | +## Background |
| 4 | + |
| 5 | +For the typical synchronous distributed training, some significant steps are as follows: |
| 6 | + |
| 7 | +1. A Trainer will compute the gradients and SEND them to the Parameter Server(PServer) nodes. |
| 8 | +1. After the PServer node received gradients came from all the Trainers, It will aggregate the |
| 9 | +gradient variables for the same parameter into one gradient variable and then apply the aggregated |
| 10 | +gradient to the respective parameter, finally using an optimize algorithms(SGD, Monument...) |
| 11 | +to update the parameters. |
| 12 | +1. The Trainer would wait for the PServers finished the optimize stage, and GET the parameters from PServer, |
| 13 | +so all the Trainers would get the same parameters. |
| 14 | + |
| 15 | +In the synchronously distributed training, there should be a `Barrier` to synchronise the |
| 16 | +parameters after the optimizing stage. The performance of a distributed training job would |
| 17 | +depend on the slowest node if there were hundreds or thousands of training nodes in a |
| 18 | +Job, the performance of synchronously distributed training might be very poor because of |
| 19 | +the slow node. So this design doc would introduce an approach to implement |
| 20 | +*asynchronously* distributed training in PaddlePaddle Fluid. |
| 21 | + |
| 22 | +## Design |
| 23 | + |
| 24 | +<img src="./src/async_update.png" width="600"/> |
| 25 | + |
| 26 | +As the figure above, we describe a global view of asynchronously update process and use |
| 27 | +the parameter `w1` as an example to introduce the steps: |
| 28 | +1. For each gradient variables, they may distribute on different GPU card and aggregate |
| 29 | +them while they are all calculated. |
| 30 | +1. Split the gradient variable into multiple blocks according to the number of PServer |
| 31 | +instances and then send them. |
| 32 | +1. PServer would run an `Optimize Block` using a specified optimize algorithm to update |
| 33 | +the specified parameter. |
| 34 | +1. The trainer will fetch latest parameter from PServer before running forward Op which depends |
| 35 | +on the specified parameter. |
| 36 | +1. Broadcast the received variable into multiple GPU cards and continue to run the next |
| 37 | +mini-batch. |
| 38 | + |
| 39 | +### Trainer |
| 40 | + |
| 41 | +- For the multiple devices distributed training, we need to aggregate the gradient |
| 42 | +variables which placed on different devices firstly and then schedule a `SendVars` Operator to |
| 43 | +send the gradient variables to the multiple PServer instances. |
| 44 | +- Schedule `FetchVars` operator to fetch the latest parameter from PServer before running |
| 45 | +the forward ops. |
| 46 | +- There could be a large number of gradient variables to be sent, so we need to use another |
| 47 | +thread pool(IO Threadpool) whose a number of the schedulable threads is larger than the |
| 48 | +computing thread pool to avoid competitive the thread resources with computing. |
| 49 | + |
| 50 | +### Parameter Server |
| 51 | + |
| 52 | +<img src="./src/async_pserver.png" width="750"/> |
| 53 | + |
| 54 | +- There should be multiple trainer instances want to optimize the same parameter at |
| 55 | +the same time, to avoid the racing, we need one `BlockingQueue` for each gradient |
| 56 | +variable to process them one by one. |
| 57 | +- We need a `Map` structure to map a gradient variable name to the `OptimizeBlock` which |
| 58 | +can optimize the respective parameter. |
0 commit comments