Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit 8a08aee

Browse files
committed
Some polishing
1 parent 3a0dd12 commit 8a08aee

File tree

1 file changed

+8
-9
lines changed

1 file changed

+8
-9
lines changed

rfcs/20201121-keras-model-fit-ps.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -112,11 +112,11 @@ This section discusses the changes needed to be made in `model` API and assumes
112112

113113
In this design, we propose `model.fit` to take a new type, `tf.data.experimental.DatasetFactory`, instead of a dataset instance (which is [what is currently supported](https://github.com/tensorflow/tensorflow/blob/6b9e35f1c0410607bf956c0f27e5d3e1456b4899/tensorflow/python/keras/engine/training.py#L887-L889)), for the following reasons:
114114

115-
* With `dataset` instances, there is complication brought by the need of replicating `dataset`s to workers.
115+
* With `dataset` instances, there is a need of replicating `dataset`s to remote workers. This is not a well-established path and thus not supported yet by `ClusterCoordinator`.
116116

117-
* With `dataset` replication, in the past we have observed more memory consumption, less flexibility for user’s dataset transformation, and suboptimal performance.
117+
* Even if `dataset` replication is supported, in the past we have observed more memory consumption, less flexibility for user’s dataset transformation, and suboptimal performance.
118118

119-
* When using Keras preprocessing layers (KPL), read-only resources are created at layer creation, which ends up being placed at the coordinator. However, `tf.data replicate` API does not support the resources referenced in the dataset graph to be accessed once serialized and deserialized, in the remotely worker. This prevents the `dataset` instance path from supporting resources, and thus KPLs.
119+
* When using resources for datasets, the resources end up being placed at the coordinator. However, `tf.data replicate` API does not support the resources referenced in the dataset graph to be accessed once serialized and deserialized, in the remote worker. This prevents the `dataset` instance path from supporting resources, and thus Keras preprocessing layers.
120120

121121
Please see below for the rationale of using a `DatasetFactory` type instead of a simple `callable`.
122122

@@ -196,8 +196,7 @@ To take advantage of TF2 support of parameter server training, a `ClusterCoordin
196196
class Model(...):
197197
198198
def fit(self, ...):
199-
if (self.distribute_strategy.should_use_with_coordinator and
200-
not self.distribute_strategy._cluster_coordinator):
199+
if self.distribute_strategy.should_use_with_coordinator:
201200
cluster_coordinator.ClusterCoordinator(self.distribute_strategy)
202201
... # the rest of fit
203202
@@ -483,9 +482,9 @@ We propose that `ParameterServerStrategy` has an attribute `should_use_with_coor
483482
self.should_use_with_coordinator = True
484483
```
485484

486-
#### `ClusterCoordinator` as a single instance to `Strategy`
485+
#### `ClusterCoordinator` as a single instance to `Strategy` obtained at constructor call
487486

488-
Since a `ClusterCoordinator` instance spins off worker and failure handling threads, there should only be one `ClusterCoordinator` at any given time with a `strategy` instance, and making it a singleton ensures that those threads are only created once. The singleton is accessible through a constructor call:
487+
Since a `ClusterCoordinator` instance spins off worker and failure handling threads, there should only be one `ClusterCoordinator` at any given time with a `strategy` instance, and making it a single instance with a given `Strategy` ensures that those threads are only created once. We propose that the single instance is returned through a constructor call:
489488

490489
```
491490
class ClusterCoordinator(object):
@@ -497,9 +496,9 @@ class ClusterCoordinator(object):
497496

498497
Here, we have created this attribute referencing `cluster_coordinator` from `strategy`. This is necessary because `Model` only keeps a reference of `strategy`, and this allows `Model` to have access to this `ClusterCoordinator` instance.
499498

500-
Being a singleton is important considering there are power users who would like to `schedule` functions themselves in addition to `model.fit` usage. That is, they can instantiate one before `model.fit` does, or use one after `model.fit` has instantiated one. In either case, they should access the same `ClusterCoordinator` instance, as the one `model.fit` uses.
499+
Obtaining the single instance at constructor call is useful considering there are power users who would like to `schedule` functions themselves in addition to `model.fit` usage, and naturally they would construct a `ClusterCoordinator`. They may do so before `model.fit` does, or after `model.fit` has instantiated one. In either case, they should access the same `ClusterCoordinator` instance, as the one `model.fit` uses.
501500

502-
Obtaining the singleton by calling the constructor of `ClusterCoordinator`, as opposed to an instance getter, provides the future-compatibility if we allow multiple `ClusterCoordinator`s in the future.
501+
Another advantage of using the constructor of `ClusterCoordinator`, as opposed to an instance getter, is it provides the future-compatibility if we allow multiple `ClusterCoordinator`s in the future.
503502

504503
#### `ClusterCoordinator`’s reference to `ParameterServerStrategy` as a `weakref`
505504

0 commit comments

Comments
 (0)