Skip to content

Commit 7a2297d

Browse files
author
Yancey
authored
Merge pull request #9932 from Yancey1989/async_update_design
Add async update design doc
2 parents 64bf3df + 936dfcb commit 7a2297d

File tree

5 files changed

+58
-0
lines changed

5 files changed

+58
-0
lines changed
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Design Doc: Asynchronous Update With Distributed Training
2+
3+
## Background
4+
5+
For the typical synchronous distributed training, some significant steps are as follows:
6+
7+
1. A Trainer will compute the gradients and SEND them to the Parameter Server(PServer) nodes.
8+
1. After the PServer node received gradients came from all the Trainers, It will aggregate the
9+
gradient variables for the same parameter into one gradient variable and then apply the aggregated
10+
gradient to the respective parameter, finally using an optimize algorithms(SGD, Monument...)
11+
to update the parameters.
12+
1. The Trainer would wait for the PServers finished the optimize stage, and GET the parameters from PServer,
13+
so all the Trainers would get the same parameters.
14+
15+
In the synchronously distributed training, there should be a `Barrier` to synchronise the
16+
parameters after the optimizing stage. The performance of a distributed training job would
17+
depend on the slowest node if there were hundreds or thousands of training nodes in a
18+
Job, the performance of synchronously distributed training might be very poor because of
19+
the slow node. So this design doc would introduce an approach to implement
20+
*asynchronously* distributed training in PaddlePaddle Fluid.
21+
22+
## Design
23+
24+
<img src="./src/async_update.png" width="600"/>
25+
26+
As the figure above, we describe a global view of asynchronously update process and use
27+
the parameter `w1` as an example to introduce the steps:
28+
1. For each gradient variables, they may distribute on different GPU card and aggregate
29+
them while they are all calculated.
30+
1. Split the gradient variable into multiple blocks according to the number of PServer
31+
instances and then send them.
32+
1. PServer would run an `Optimize Block` using a specified optimize algorithm to update
33+
the specified parameter.
34+
1. The trainer will fetch latest parameter from PServer before running forward Op which depends
35+
on the specified parameter.
36+
1. Broadcast the received variable into multiple GPU cards and continue to run the next
37+
mini-batch.
38+
39+
### Trainer
40+
41+
- For the multiple devices distributed training, we need to aggregate the gradient
42+
variables which placed on different devices firstly and then schedule a `SendVars` Operator to
43+
send the gradient variables to the multiple PServer instances.
44+
- Schedule `FetchVars` operator to fetch the latest parameter from PServer before running
45+
the forward ops.
46+
- There could be a large number of gradient variables to be sent, so we need to use another
47+
thread pool(IO Threadpool) whose a number of the schedulable threads is larger than the
48+
computing thread pool to avoid competitive the thread resources with computing.
49+
50+
### Parameter Server
51+
52+
<img src="./src/async_pserver.png" width="750"/>
53+
54+
- There should be multiple trainer instances want to optimize the same parameter at
55+
the same time, to avoid the racing, we need one `BlockingQueue` for each gradient
56+
variable to process them one by one.
57+
- We need a `Map` structure to map a gradient variable name to the `OptimizeBlock` which
58+
can optimize the respective parameter.
Binary file not shown.
166 KB
Loading
Binary file not shown.
180 KB
Loading

0 commit comments

Comments
 (0)