You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 10, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: rfcs/20201121-keras-model-fit-ps.md
+8-9Lines changed: 8 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -112,11 +112,11 @@ This section discusses the changes needed to be made in `model` API and assumes
112
112
113
113
In this design, we propose `model.fit` to take a new type, `tf.data.experimental.DatasetFactory`, instead of a dataset instance (which is [what is currently supported](https://github.com/tensorflow/tensorflow/blob/6b9e35f1c0410607bf956c0f27e5d3e1456b4899/tensorflow/python/keras/engine/training.py#L887-L889)), for the following reasons:
114
114
115
-
* With `dataset` instances, there is complication brought by the need of replicating `dataset`s to workers.
115
+
* With `dataset` instances, there is a need of replicating `dataset`s to remote workers. This is not a well-established path and thus not supported yet by `ClusterCoordinator`.
116
116
117
-
*With `dataset` replication, in the past we have observed more memory consumption, less flexibility for user’s dataset transformation, and suboptimal performance.
117
+
*Even if `dataset` replication is supported, in the past we have observed more memory consumption, less flexibility for user’s dataset transformation, and suboptimal performance.
118
118
119
-
* When using Keras preprocessing layers (KPL), read-only resources are created at layer creation, which ends up being placed at the coordinator. However, `tf.data replicate` API does not support the resources referenced in the dataset graph to be accessed once serialized and deserialized, in the remotely worker. This prevents the `dataset` instance path from supporting resources, and thus KPLs.
119
+
* When using resources for datasets, the resources end up being placed at the coordinator. However, `tf.data replicate` API does not support the resources referenced in the dataset graph to be accessed once serialized and deserialized, in the remote worker. This prevents the `dataset` instance path from supporting resources, and thus Keras preprocessing layers.
120
120
121
121
Please see below for the rationale of using a `DatasetFactory` type instead of a simple `callable`.
122
122
@@ -196,8 +196,7 @@ To take advantage of TF2 support of parameter server training, a `ClusterCoordin
196
196
class Model(...):
197
197
198
198
def fit(self, ...):
199
-
if (self.distribute_strategy.should_use_with_coordinator and
200
-
not self.distribute_strategy._cluster_coordinator):
199
+
if self.distribute_strategy.should_use_with_coordinator:
@@ -483,9 +482,9 @@ We propose that `ParameterServerStrategy` has an attribute `should_use_with_coor
483
482
self.should_use_with_coordinator = True
484
483
```
485
484
486
-
#### `ClusterCoordinator` as a single instance to `Strategy`
485
+
#### `ClusterCoordinator` as a single instance to `Strategy`obtained at constructor call
487
486
488
-
Since a `ClusterCoordinator` instance spins off worker and failure handling threads, there should only be one `ClusterCoordinator` at any given time with a `strategy` instance, and making it a singleton ensures that those threads are only created once. The singleton is accessible through a constructor call:
487
+
Since a `ClusterCoordinator` instance spins off worker and failure handling threads, there should only be one `ClusterCoordinator` at any given time with a `strategy` instance, and making it a single instance with a given `Strategy`ensures that those threads are only created once. We propose that the single instance is returned through a constructor call:
489
488
490
489
```
491
490
class ClusterCoordinator(object):
@@ -497,9 +496,9 @@ class ClusterCoordinator(object):
497
496
498
497
Here, we have created this attribute referencing `cluster_coordinator` from `strategy`. This is necessary because `Model` only keeps a reference of `strategy`, and this allows `Model` to have access to this `ClusterCoordinator` instance.
499
498
500
-
Being a singleton is important considering there are power users who would like to `schedule` functions themselves in addition to `model.fit` usage. That is, they can instantiate one before `model.fit` does, or use one after `model.fit` has instantiated one. In either case, they should access the same `ClusterCoordinator` instance, as the one `model.fit` uses.
499
+
Obtaining the single instance at constructor call is useful considering there are power users who would like to `schedule` functions themselves in addition to `model.fit` usage, and naturally they would construct a `ClusterCoordinator`. They may do so before `model.fit` does, or after `model.fit` has instantiated one. In either case, they should access the same `ClusterCoordinator` instance, as the one `model.fit` uses.
501
500
502
-
Obtaining the singleton by calling the constructor of `ClusterCoordinator`, as opposed to an instance getter, provides the future-compatibility if we allow multiple `ClusterCoordinator`s in the future.
501
+
Another advantage of using the constructor of `ClusterCoordinator`, as opposed to an instance getter, is it provides the future-compatibility if we allow multiple `ClusterCoordinator`s in the future.
503
502
504
503
#### `ClusterCoordinator`’s reference to `ParameterServerStrategy` as a `weakref`
0 commit comments