|
1 |
| -# Design Doc: Large Model |
| 1 | +# Design Doc: Prefecting Parameter From Parameter Server |
2 | 2 |
|
3 | 3 | ## Abstract
|
4 | 4 |
|
5 |
| -We propose an approach to support the large parameter. |
6 |
| -For embedding layer, the parameter may very large and could |
7 |
| -not be stored in one trainer's memory. In this approach, a Trainer would |
8 |
| -prefetch a sliced parameter from different Parameter Server instances |
9 |
| -according to the input `Ids`, and then run forward, backward and send |
10 |
| -the gradient to Parameter Server to execute the optimize program. |
| 5 | +We propose an approach to prefetch parameter from Parameter |
| 6 | +Server while distributed training so that Fluid would training |
| 7 | +a model including the large parameter which could not be stored in one |
| 8 | +trainer's memory. |
| 9 | + |
| 10 | +## Background |
| 11 | + |
| 12 | +For an embedding layer, the trainable parameter may be very large and could |
| 13 | +not be stored in one trainer's memory. In Fluid distributed training, |
| 14 | +[Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into a number of small |
| 15 | +parameters and stored in Parameter Server, so we could prefetch the parameter |
| 16 | +from the specified Parameter Server according to the input `Ids`. |
11 | 17 |
|
12 | 18 | ## Design
|
13 | 19 |
|
14 |
| -**NOTE**: this approach is a feature of Fluid distributed trianing, maybe you want |
| 20 | +This is a feature of Fluid distributed training, maybe you want |
15 | 21 | to know [Distributed Architecture](./distributed_architecture.md) and
|
16 | 22 | [Parameter Server](./parameter_server.md) before reading the following content.
|
17 | 23 |
|
18 |
| -Fluid large model distributed training use |
19 |
| -[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split |
20 |
| -a large parameter into multiple parameters which stored on Parameter Server, and |
21 |
| -the Trainer would prefetch them by `RPC` interface. |
22 |
| - |
23 |
| -### Split Large Parameter |
| 24 | +### Partationed Parameter |
24 | 25 |
|
25 | 26 | <img src="src/split_parameter.png" width="400" />
|
26 | 27 |
|
27 |
| -**Distributed Transpiler** would split the large parameter |
28 |
| -(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the |
| 28 | +- **Distributed Transpiler** would split the large parameter |
| 29 | +(weight) into some partitioned parameters (weight_0, weight_1, weight_2) as the |
29 | 30 | figure above.
|
| 31 | +- We could use `round-robin` to distribute the partitioned parameter. |
30 | 32 |
|
31 |
| -### Prefetch Parameters from Parameter Servers |
| 33 | +### Prefetching Parameter |
32 | 34 |
|
33 | 35 | <img src="src/prefetch_parameters.png" width="400" />
|
34 | 36 |
|
35 |
| -- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers, |
36 |
| - and then receive the SelctedRows. |
37 |
| -- The different with normal Fluid distributed training, we only prefetch the rows |
| 37 | +- `prefetch_rpc` operator would prefetch the parameter from different Parameter |
| 38 | + Server according with the input `Ids`, we use [SelectedRows](../../../design/selected_rows.md) |
| 39 | + as the received variable type. |
| 40 | +- `merge_selected_rows` operator would merge the received parameters into one |
| 41 | + `SelectedRows` variable. |
38 | 42 |
|
39 | 43 | ## TODO
|
40 | 44 |
|
41 |
| -- Async Update |
42 |
| - |
43 |
| - To avoid slow-node, Async update is important for distributed training, |
44 |
| - we need an design doc and implement it in future. |
| 45 | +- `prefetch_rpc` operator to send rows index and receive SelectedRows variables. |
| 46 | +- `lookup_table` need to support `SelectedRows` variable type as input `Weight`. |
| 47 | +- Async Update, To avoid slow-node, Async update is important for distributed training, |
| 48 | + we need a design doc and implement it in future. |
0 commit comments