PaddlePaddle
diff --git a/‎doc/fluid/design/dist_train/large_model.md
Lines changed: 40 additions & 0 deletions b/‎doc/fluid/design/dist_train/large_model.md
Lines changed: 40 additions & 0 deletions
diff --git a/‎doc/fluid/design/dist_train/src/prefetch_parameters.graffle
9.84 KB b/‎doc/fluid/design/dist_train/src/prefetch_parameters.graffle
9.84 KB
diff --git a/‎doc/fluid/design/dist_train/src/prefetch_parameters.png
176 KB b/‎doc/fluid/design/dist_train/src/prefetch_parameters.png
176 KB
diff --git a/‎doc/fluid/design/dist_train/src/split_parameter.graffle
7.63 KB b/‎doc/fluid/design/dist_train/src/split_parameter.graffle
7.63 KB
diff --git a/‎doc/fluid/design/dist_train/src/split_parameter.png
67.5 KB b/‎doc/fluid/design/dist_train/src/split_parameter.png
67.5 KB
@@ -0,0 +1,40 @@
+# Design Doc: Large Model
+
+## Abstract
+
+We propose an approach to support the large parameter.
+For embedding layer, the parameter may very large and could
+not be stored in one trainer's memory. In this approach, a Trainer would
+prefetch a sliced parameter from different Parameter Server instances
+according to the input `Ids`, and then run forward, backward and send
+the gradient to Parameter Server to execute the optimize program.
+
+## Design
+
+Fluid large model distributed training use 
+[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split
+a large parameter into multiple parameters which stored on Parameter Server, and
+the Trainer would prefetch them by `RPC` interface.
+
+### Split Large Parameter
+
+<img src="src/split_parameter.png" width="400" />
+
+**Distributed Transpiler** would split the large parameter
+(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the 
+figure above.
+
+### Prefetch Parameters from Parameter Servers
+
+<img src="src/prefetch_parameters.png" width="400" />
+
+- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers,
+  and then receive the SelctedRows.
+- The different with normal Fluid distributed training, we only prefetch the rows
+
+## TODO
+
+- Async Update
+
+  To avoid slow-node, Async update is important for distributed training,
+  we need an design doc and implement it in future.