Skip to content

Commit 285e7ac

Browse files
committed
merge fluid lookup table into abacus one
1 parent 1c5616b commit 285e7ac

10 files changed

+27
-50
lines changed

doc/fluid/design/dist_train/distributed_lookup_table_design.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,33 @@ optimization algorithm $f$ runs on the storage service.
119119
- Con: the storage service needs to be able to run the optimization
120120
algorithm.
121121

122+
## Distributed Sparse Table in Fluid
123+
124+
For another design, we can implement a distributed sparse table in Fluid,
125+
and don't need to maintain an external storage component while training.
126+
127+
Prior to reading this design, it would be useful for the reader to make themselves
128+
familiar with Fluid [Distributed Training Architecture](./distributed_architecture.md)
129+
and [Parameter Server](./parameter_server.md).
130+
131+
![fluid lookup remote table](./src/fluid_lookup_remote_table.png)
132+
133+
Partition a large table into multiple pserver instances
134+
1. `DistributeTranspiler` would split the table partitioned into some small
135+
table blocks with some partitioned algorithms such as
136+
[RoundRobin](https://en.wikipedia.org/wiki/Round-robin_scheduling),
137+
[Hash](https://en.wikipedia.org/wiki/Hash) and etc...
138+
1. For some cases, the range of input `Ids` is very wide and unpredictable, so the sparse
139+
table would be able to fill a new value for the id that didn't appear before with
140+
zero, uniform random or Gaussian distribution.
141+
142+
For each Trainer's training process:
143+
1. In the forward pass, we use `pre-fetch` op to pre-fetch parameter blocks according to the
144+
input `Ids` from PServers instead of the local `lookup_table` op, and then merge the blocks
145+
into a parameter `W`.
146+
1. Compute `GRAD@W'` in the backward pass using the pre-fetched `W` and send it to PServer to
147+
execute the optimize pass.
148+
122149
## Conclusion
123150

124151
Let us do the "storage service does not optimize" solution first, as a

doc/fluid/design/dist_train/prefetch_parameter.md

Lines changed: 0 additions & 50 deletions
This file was deleted.
Binary file not shown.
317 KB
Loading
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)