Skip to content

Commit cb7891a

Browse files
committed
Add large model design doc
1 parent 5a159f3 commit cb7891a

File tree

5 files changed

+40
-0
lines changed

5 files changed

+40
-0
lines changed
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Design Doc: Large Model
2+
3+
## Abstract
4+
5+
We propose an approach to support the large parameter.
6+
For embedding layer, the parameter may very large and could
7+
not be stored in one trainer's memory. In this approach, a Trainer would
8+
prefetch a sliced parameter from different Parameter Server instances
9+
according to the input `Ids`, and then run forward, backward and send
10+
the gradient to Parameter Server to execute the optimize program.
11+
12+
## Design
13+
14+
Fluid large model distributed training use
15+
[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split
16+
a large parameter into multiple parameters which stored on Parameter Server, and
17+
the Trainer would prefetch them by `RPC` interface.
18+
19+
### Split Large Parameter
20+
21+
<img src="src/split_parameter.png" width="400" />
22+
23+
**Distributed Transpiler** would split the large parameter
24+
(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the
25+
figure above.
26+
27+
### Prefetch Parameters from Parameter Servers
28+
29+
<img src="src/prefetch_parameters.png" width="400" />
30+
31+
- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers,
32+
and then receive the SelctedRows.
33+
- The different with normal Fluid distributed training, we only prefetch the rows
34+
35+
## TODO
36+
37+
- Async Update
38+
39+
To avoid slow-node, Async update is important for distributed training,
40+
we need an design doc and implement it in future.
Binary file not shown.
176 KB
Loading
Binary file not shown.
67.5 KB
Loading

0 commit comments

Comments
 (0)