Skip to content

Commit 1b4db80

Browse files
committed
update by lookup remote table
1 parent 5948fd2 commit 1b4db80

File tree

7 files changed

+23
-14
lines changed

7 files changed

+23
-14
lines changed

doc/fluid/design/dist_train/prefetch_parameter.md

Lines changed: 23 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,45 @@
1-
# Design Doc: Prefetching Parameter From Parameter Server
1+
# Design Doc: Lookup Remote Table while Distributed training
22

33
## Abstract
44

5-
We propose an approach to pre-fetch the parameters from a Parameter Server while distributed training so that Fluid is able to train a model with a large number of parameters that cannot be stored in one trainer's memory.
5+
We propose an approach to pre-fetch the parameters from a Parameter Server while distributed training so that Fluid can train a model with the very large parameter that cannot be stored in one trainer's memory.
66

77
## Background
88

9-
For an embedding layer, the number of trainable parameters may be very large and it is likely that they may not be able to be stored in one trainer's memory. In Fluid distributed training,
10-
the [Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into a number of small parameters that are stored on the Parameter Server. Hence, we can pre-fetch the parameters from the specified Parameter Server using the input `Ids`.
9+
For an embedding layer, the trainable parameter may be very large, and it is likely that it may not be able to be stored in one trainer's memory. In Fluid distributed training,
10+
the [Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into some small parameters that stored on the Parameter Server. Hence, we can pre-fetch the parameter from the specified Parameter Server using the input `Ids`.
1111

1212
## Design
1313

1414
Prior to reading this design, it would be useful for the reader to make themselves familiar with Fluid [Distributed Training Architecture](./distributed_architecture.md) and
1515
[Parameter Server](./parameter_server.md).
1616

17-
### Partationed Parameter
17+
The execution of `lookup local table` is as follows:
1818

19-
<img src="src/split_parameter.png" width="400" />
19+
<img src="src/lookup_local_table.png" width="400" />
2020

21-
- **Distributed Transpiler** would split the large parameters
22-
(`weight`) into some partitioned parameters (`weight_0`, `weight_1`, `weight_2`) as shown in the
23-
figure above.
24-
- We can use `round-robin` to distribute the partitioned parameter.
21+
For some cases, the parameter(`weight`) may be very large, such as 10 billion features, the entire
22+
data could not be stored in one trainer's memory, so we need to partition this parameter and
23+
pre-fetch it at the beginning of each mini-batch, and we call it `lookup remote table`:
2524

26-
### Pre-fetching Parameters
25+
<img src="src/lookup_remote_table.png" width="400">
2726

28-
<img src="src/prefetch_parameters.png" width="400" />
27+
The processing flow of `lookup remote table` is as follows:
2928

30-
- `prefetch_rpc` operator would prefetch the parameter from different Parameter
29+
1. partitioned parameter
30+
31+
<img src="src/split_parameter.png" width="400" />
32+
33+
- **Distributed Transpiler** would split the large parameters
34+
(`weight`) into some partitioned parameters (`weight_0`, `weight_1`, `weight_2`) as shown in the figure above.
35+
- We can use `round-robin` to distribute the partitioned parameter.
36+
37+
1. pre-fetching parameter at the beginning of each mini-batch
38+
39+
- `prefetch_rpc` operator would prefetch the parameter from different Parameter
3140
Servers using the input `Ids`. We use [SelectedRows](../../../design/selected_rows.md)
3241
as the received variable type.
33-
- `merge_selected_rows` operator would merge the received parameters into one
42+
- `merge_selected_rows` operator would merge the received parameters into one
3443
`SelectedRows` variable.
3544

3645
## TODO
Binary file not shown.
297 KB
Loading
Binary file not shown.
284 KB
Loading
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)