You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 10, 2025. It is now read-only.
implemented for collectives, and it might be possible to reuse some of
91
-
the graph analysis code
87
+
#### Fuse the tensors into a single Send/Recv Solution 2 (Derek Murray)
88
+
Pack the tensor contents into a single flattened buffer. This would be very
89
+
similar to the ScopedAllocator optimization that [email protected] and
90
+
[email protected] implemented for collectives, and it might be possible
91
+
to reuse some of the graph analysis code
92
92
93
-
Cons: Reuse currently RPC, avoid potiential intricate changes in zero-copy
94
-
repsonse buffer code.
95
-
Pros: The fused tensors could be different types and dynamic shape,
96
-
which couldn't be handled by the solution.
93
+
Pros: Reuse currently RPC, avoid potential intricate changes in zero-copy
94
+
response buffer code.
95
+
Cons: The fused tensors could be of different types and dynamic shapes,
96
+
which couldn't be handled by this solution.
97
97
98
-
#### Dynamic Fusion in runtime(Paul Tucker)
98
+
#### Dynamic Fusion in runtime(Paul Tucker)
99
99
Instead of adding a new FuseRecvTensor method to the Worker interface,
100
100
we add a slightly different RecvSomeTensors method. The client sends a
101
101
list of keys for which it's ready to receive values to the server and the
@@ -107,7 +107,7 @@ For example, on the client side a call to RecvTensor on the local Rendezvous
107
107
for a remote value does not necessarily result in an immediate RPC. It might
108
108
if the value is expected to be large, but it might also just add the key to
109
109
a ready set associated with the remote host. An RPC may not be sent until
110
-
the ready set reaches a certain size, or a mininum time has elapsed since the
110
+
the ready set reaches a certain size, or a minimum time has elapsed since the
111
111
last RPC against that host was started. When the response is received any
112
112
missing keys go back in the ready set.
113
113
@@ -116,10 +116,10 @@ method whether to wait for more of the requested values to be ready or just
116
116
immediately send what's available now and let the client re-request anything
117
117
missing.
118
118
119
-
Cons: Dynamic fusion in runtime seems get better result, and also brings
119
+
Pros: Dynamic fusion in runtime seems get better result, and also brings
120
120
ability to control priority of tensors (which Recv is more important).
121
-
Pros: Potential bottleneck of the solution is the time window of ready set.
122
-
For different models it would be much different, manually set the value
121
+
Cons: Potential bottleneck of the solution is the time window of ready set.
122
+
For different models it would be much different, manually setting the value
123
123
would be hard. This solution is another good candidate of FuseRecv.
124
124
125
125
### Performance Implications
@@ -131,37 +131,38 @@ by 55%, and the overall training throughput has increased by 40%.
131
131
* None
132
132
133
133
### Engineering Impact
134
-
* Engineering impact: Once manually enable the feature (in ConfigProto.GraphOptions.do_fuse_recv), test times would be slower because FuseRecv post-partitioned optimizer would traverse and update graph.
135
-
* Maintenance: Minimal maintennace overhead. The TensorFlow team and contributors will maintain the documentation up to date. Changes should be reviewed and approved by the TensorFlow team leads.
134
+
* Engineering impact: Once the feature is (manually) enabled (in ConfigProto.GraphOptions.do_fuse_recv), the test times would be longer because the FuseRecv post-partitioned optimizer would traverse and update the graph.
135
+
* Maintenance: Minimal maintenance overhead. The TensorFlow team and contributors will maintain the documentation and keep it up to date. Changes should be reviewed and approved by the TensorFlow team leads.
136
136
137
137
### Platforms and Environments
138
138
* Platforms: The feature is independent of platforms.
139
-
* Execution environments (Cloud services, accelerator hardware): The first stage would support CPU & GPU device. We consider more device supported as much as possible.
139
+
* Execution environments (Cloud services, accelerator hardware): The first stage would support CPU & GPU device. We consider supporting
140
+
additional devices as much as possible.
140
141
141
142
### Best Practices
142
-
* We strongly suggest enable FuseRecv in rank or match models such as [W&DL](https://arxiv.org/abs/1606.07792), [Dien](https://arxiv.org/abs/1809.03672).
143
+
* We strongly suggest to enable FuseRecv in rank or match models such as [W&DL](https://arxiv.org/abs/1606.07792), [Dien](https://arxiv.org/abs/1809.03672).
0 commit comments