Skip to content

Commit 8d73752

Browse files
committed
add async update design doc
1 parent 494c262 commit 8d73752

File tree

5 files changed

+52
-0
lines changed

5 files changed

+52
-0
lines changed
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Design Doc: Asynchronous Update With Distributed Training
2+
3+
## Background
4+
5+
For the typical synchronous distributed training, some significant steps are as follows:
6+
7+
1. A Trainer will compute the gradients and SEND them to the Parameter
8+
Server(PServer) nodes.
9+
1. After the PServer node received gradients came from all the Trainers, it would apply the gradient to the respective variables, and using an optimize algorithms(SGD,
10+
Momentment...) to update the parameters.
11+
1. The Trainer would wait for the PServers finished the optimize stage, and GET the parameters from PServer, so all the Trainers would get the same parameters.
12+
13+
In the synchronously distributed training, there should be a `Barrier` to synchronise the
14+
parameters after the optimizing stage. The performance of a distributed training job
15+
depends on the lowest node, if there were hundreds or thousand training nodes in a Job,
16+
the performance of synchronously distributed training might be very slow because of
17+
the slow node. So this design doc would introduce an approach to implement
18+
*asynchronously* distributed training in PaddlePaddle Fluid.
19+
20+
## Design
21+
22+
<img src="./src/async_update.png" width="450"/>
23+
24+
As the figure above, we describe a global view of asynchronously update process and use
25+
the parameter `w1` as an example to introduce the steps:
26+
1. For each gradient variables, they may distribute on different GPU card and aggregate
27+
them while they are all calculated.
28+
1. Split the gradient variable into multiple blocks according to the number of PServer
29+
instances and sent them.
30+
1. PServer would run an `Optimize Block` to use a specified optimize algorithm to update
31+
the specified parameter, such as `w1`.
32+
1. The trainer will fetch the latest parameter after PServer finished the optimize stage.
33+
1. Broadcast the received variable into multiple GPU cards and continue to run the next
34+
mini-batch.
35+
36+
### Trainer
37+
38+
- We need a new Operator named `RemoteOptimize` to send gradients to multiple PServer
39+
instances and fetch the latest parameter.
40+
- There could be a large number of gradient variables to be sent, so we need to use another
41+
thread pool(IO Threadpool) which number of the schedulable threads is larger than the
42+
computing thread pool to avoid competitive the thread resources with computing.
43+
44+
### Parameter Server
45+
46+
<img src="./src/async_pserver.png" width="750"/>
47+
48+
- There should be multiple trainer instances want to optimize the same parameter at
49+
the same time, to avoid the pollution, we need one `BlockingQueue` for each gradient
50+
variable to process them one by one.
51+
- We need a `Map` structure to map a gradient variable name to the `OptimizeBlock` which
52+
can optimize the respective parameter.
Binary file not shown.
161 KB
Loading
Binary file not shown.
188 KB
Loading

0 commit comments

Comments
 (0)