Skip to content

Commit f839e91

Browse files
committed
update by comment
1 parent b382747 commit f839e91

File tree

4 files changed

+28
-24
lines changed

4 files changed

+28
-24
lines changed
Lines changed: 28 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,48 @@
1-
# Design Doc: Large Model
1+
# Design Doc: Prefecting Parameter From Parameter Server
22

33
## Abstract
44

5-
We propose an approach to support the large parameter.
6-
For embedding layer, the parameter may very large and could
7-
not be stored in one trainer's memory. In this approach, a Trainer would
8-
prefetch a sliced parameter from different Parameter Server instances
9-
according to the input `Ids`, and then run forward, backward and send
10-
the gradient to Parameter Server to execute the optimize program.
5+
We propose an approach to prefetch parameter from Parameter
6+
Server while distributed training so that Fluid would training
7+
a model including the large parameter which could not be stored in one
8+
trainer's memory.
9+
10+
## Background
11+
12+
For an embedding layer, the trainable parameter may be very large and could
13+
not be stored in one trainer's memory. In Fluid distributed training,
14+
[Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into a number of small
15+
parameters and stored in Parameter Server, so we could prefetch the parameter
16+
from the specified Parameter Server according to the input `Ids`.
1117

1218
## Design
1319

14-
**NOTE**: this approach is a feature of Fluid distributed trianing, maybe you want
20+
This is a feature of Fluid distributed training, maybe you want
1521
to know [Distributed Architecture](./distributed_architecture.md) and
1622
[Parameter Server](./parameter_server.md) before reading the following content.
1723

18-
Fluid large model distributed training use
19-
[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split
20-
a large parameter into multiple parameters which stored on Parameter Server, and
21-
the Trainer would prefetch them by `RPC` interface.
22-
23-
### Split Large Parameter
24+
### Partationed Parameter
2425

2526
<img src="src/split_parameter.png" width="400" />
2627

27-
**Distributed Transpiler** would split the large parameter
28-
(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the
28+
- **Distributed Transpiler** would split the large parameter
29+
(weight) into some partitioned parameters (weight_0, weight_1, weight_2) as the
2930
figure above.
31+
- We could use `round-robin` to distribute the partitioned parameter.
3032

31-
### Prefetch Parameters from Parameter Servers
33+
### Prefetching Parameter
3234

3335
<img src="src/prefetch_parameters.png" width="400" />
3436

35-
- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers,
36-
and then receive the SelctedRows.
37-
- The different with normal Fluid distributed training, we only prefetch the rows
37+
- `prefetch_rpc` operator would prefetch the parameter from different Parameter
38+
Server according with the input `Ids`, we use [SelectedRows](../../../design/selected_rows.md)
39+
as the received variable type.
40+
- `merge_selected_rows` operator would merge the received parameters into one
41+
`SelectedRows` variable.
3842

3943
## TODO
4044

41-
- Async Update
42-
43-
To avoid slow-node, Async update is important for distributed training,
44-
we need an design doc and implement it in future.
45+
- `prefetch_rpc` operator to send rows index and receive SelectedRows variables.
46+
- `lookup_table` need to support `SelectedRows` variable type as input `Weight`.
47+
- Async Update, To avoid slow-node, Async update is important for distributed training,
48+
we need a design doc and implement it in future.
Binary file not shown.
Binary file not shown.
9.38 KB
Loading

0 commit comments

Comments
 (0)