Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit 389c14e

Browse files
committed
Updates
1 parent 2173b10 commit 389c14e

File tree

1 file changed

+11
-9
lines changed

1 file changed

+11
-9
lines changed

rfcs/20201121-keras-model-fit-ps.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -147,11 +147,11 @@ Cons:
147147

148148
###### Option 2: dataset factory
149149

150-
For future-compatibility of `model.fit` API where a `dataset_fn` may have a signature change, a `DatasetFactory` can come handy which determines how the function is supposed to be used. In the case where a `ClusterCoordinator` is used, it is supposed to be invoked with no arguments. With this, user flow will become:
150+
For future-compatibility of `model.fit` API where a `dataset_fn` may have a signature change, a `DatasetFactory` can come handy which determines how the function is supposed to be used.
151151

152152

153153
```
154-
def dataset_fn(): return tf.data.Dataset.from_tensor_slices(...)
154+
def dataset_fn(input_context): return tf.data.Dataset.from_tensor_slices(...)
155155
history = model.fit(DatasetFactory(dataset_fn), epochs=..., steps_per_epoch=..., callbacks=[...])
156156
```
157157

@@ -220,10 +220,16 @@ Being a singleton is important considering there are power users who would like
220220

221221
##### Have an attribute in `ParameterServerStrategy` that holds the `ClusterCoordinator`
222222

223-
We propose that an attribute is added to the `ParameterServerStrategy` to keep track of the `ClusterCoordinator`. We instantiate `ClusterCoordinator` as soon as `ParameterServerStrategy` is instantiated:
223+
We propose that an attribute is added to `ParameterServerStrategy` to keep track of the `ClusterCoordinator`. When a `ClusterCooridinator` is instantiated, such attribute will be set. Here, we assume that the distribution `Strategy` object can determine whether or not it is supposed to be used with a `ClusterCoordinator`. See below “Changes in tf.distribute” section for more information.
224224

225225
```
226+
class ClusterCoordinator(...):
227+
def __init__(self, strategy):
228+
self.strategy = weakref.ref(strategy)
229+
strategy.cluster_coordinator = self
230+
```
226231

232+
And, we instantiate the `ClusterCoordinator` as soon as `model.fit` is called for the first time. It will then be reused for the next `fit`, or on a different model.
227233

228234
```
229235
class Model(...):
@@ -236,11 +242,7 @@ class Model(...):
236242
... # the rest of fit
237243
238244
```
239-
To avoid the circular referencing between `ParameterServerStrategy` and `ClusterCoordinator` and the resulting leak, the `coordinator`’s reference to `strategy` should be a `weakref`.
240-
241-
This option is with the assumption that there is always only one `ParameterServerStrategy` used*, and that we are not supporting the use case where the user creates an additional `ClusterCoordinator`.
242-
243-
*This is because there's currently not yet a clean way to shut down `ClusterCoordinator`, so we can't support more than one `ClusterCoordinator`, and thus no more than one `ParameterServerStrategy`.
245+
To avoid the leak resulting from the circular referencing between `ParameterServerStrategy` and `ClusterCoordinator`, the `coordinator`’s reference to `strategy` should be a `weakref`.
244246

245247

246248
#### Keras `Model` changes
@@ -610,7 +612,7 @@ Pros:
610612

611613
Cons:
612614
* This solution presents a challenge when workers can easily become unavailable, in which case it is not straightforward to immediately find another available worker to take over*
613-
* This solution is blocked on `tf.keras.models.load_model` being available on PS (if we have another cloning solution, that works too)
615+
* This solution is blocked on `tf.keras.models.load_model` being available on PS, if `variable_partitioner` is used. Here, model saving and loading are for cloning the model, so if there is an alternative to clone, this solution is not blocked.
614616
* Users who can afford to allocate a high priority on an evaluator task cannot do so with workers; workers would simply have the same, usually lower, priority (and thus more frequent function-takeovers)
615617

616618
*Fault tolerance, the first con, may further be addressed with possibly another `ClusterCoordinator`, if it shares the threads with the other `ClusterCoordinator`, and the library allows multiple function queues to be accessed by the threads. More details may be discussed in a separate RFC.

0 commit comments

Comments
 (0)