22
33| Status | (Proposed / Accepted / Implemented / Obsolete) |
44:-------------- |:---------------------------------------------------- |
5- |
65| ** Author(s)
** | Tongxuan Liu(
[email protected] ) Peng Tao(
[email protected] ) Langshi Chen (
[email protected] )
| 7- | ** Sponsor** | i |
8- | ** Updated** | 2020-04-09 |
6+ | ** Reviewers(s)
** | Ayush Dubey(
[email protected] ) Jeroen Bédorf(
[email protected] ) Derek Murray(
[email protected] ) Bairen Yi(
[email protected] ) Paul Tucker(
[email protected] )
| 7+ | ** Sponsor** | |
8+ | ** Updated** | 2020-04-11 |
99
1010## Objective
1111This RFC proposes a new FuseRecv Op which would receive multiple tensors with
@@ -54,8 +54,8 @@ be 1.5-2x timer faster in the parameter-server/worker setup.
5454
5555## Design Proposal
5656
57- ![ Figure 1: Current graph partition strategy] ( 20200409 -fuse_recv/current_graph_partition_strategy.png " Current graph partition strategy ")
58- ![ Figure 2: Graph partition strategy with FuseRecv] ( 20200409 -fuse_recv/graph_partition_strategy_with_fuse_recv.png " Graph partition strategy with FuseRecv ")
57+ ![ Figure 1: Current graph partition strategy] ( 20200411 -fuse_recv/current_graph_partition_strategy.png " Current graph partition strategy ")
58+ ![ Figure 2: Graph partition strategy with FuseRecv] ( 20200411 -fuse_recv/graph_partition_strategy_with_fuse_recv.png " Graph partition strategy with FuseRecv ")
5959
6060In the original Recv/Send design, each Recv node only receives one tensor
6161even if there are Recv Ops that output to the same destination Op. Moreover each
@@ -82,6 +82,7 @@ Pack the N tensors to be sent into a length-N DT_VARIANT vector.
8282
8383Pros: Reuse currently RPC, avoid potential intricate changes in zero-copy
8484response buffer code.
85+
8586Cons: Introduce memcopy overhead.
8687
8788#### Fuse the tensors into a single Send/Recv Solution 2 (Derek Murray)
@@ -92,6 +93,7 @@ to reuse some of the graph analysis code
9293
9394Pros: Reuse currently RPC, avoid potential intricate changes in zero-copy
9495response buffer code.
96+
9597Cons: The fused tensors could be of different types and dynamic shapes,
9698which couldn't be handled by this solution.
9799
@@ -118,14 +120,15 @@ missing.
118120
119121Pros: Dynamic fusion in runtime seems get better result, and also brings
120122ability to control priority of tensors (which Recv is more important).
123+
121124Cons: Potential bottleneck of the solution is the time window of ready set.
122125For different models it would be much different, manually setting the value
123126would be hard. This solution is another good candidate of FuseRecv.
124127
125128### Performance Implications
126129With a wide and deep model, the number of RPCs calls per step has been reduced
127130by 55%, and the overall training throughput has increased by 40%.
128- ![ Figure 3: performance_result] ( 20200409 -fuse_recv/performance_result.png " Performance result ")
131+ ![ Figure 3: performance_result] ( 20200411 -fuse_recv/performance_result.png " Performance result ")
129132
130133### Dependencies
131134* None
@@ -187,7 +190,7 @@ the whole graph, replace Recv ops by FuseRecv ops in the partitioned graphs acco
187190to its topology while iteratively searching and fusing potential Recv
188191operations. See Figure 4 for the formal algorithm definition.
189192
190- ![ Figure 4: fuse_recv_procedure] ( 20200409 -fuse_recv/fuse_recv_procedure.png " Fuse Recv Procedure ")
193+ ![ Figure 4: fuse_recv_procedure] ( 20200411 -fuse_recv/fuse_recv_procedure.png " Fuse Recv Procedure ")
191194
192195The procedure RECVFUSE takes two input arguments: 1) the TF computation
193196graph g, 2) a Partitioned graph. It is worth noting that the iteration of
0 commit comments