You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 10, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: rfcs/20201121-keras-model-fit-ps.md
+11-9Lines changed: 11 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -147,11 +147,11 @@ Cons:
147
147
148
148
###### Option 2: dataset factory
149
149
150
-
For future-compatibility of `model.fit` API where a `dataset_fn` may have a signature change, a `DatasetFactory` can come handy which determines how the function is supposed to be used. In the case where a `ClusterCoordinator` is used, it is supposed to be invoked with no arguments. With this, user flow will become:
150
+
For future-compatibility of `model.fit` API where a `dataset_fn` may have a signature change, a `DatasetFactory` can come handy which determines how the function is supposed to be used.
history = model.fit(DatasetFactory(dataset_fn), epochs=..., steps_per_epoch=..., callbacks=[...])
156
156
```
157
157
@@ -220,10 +220,16 @@ Being a singleton is important considering there are power users who would like
220
220
221
221
##### Have an attribute in `ParameterServerStrategy` that holds the `ClusterCoordinator`
222
222
223
-
We propose that an attribute is added to the `ParameterServerStrategy` to keep track of the `ClusterCoordinator`. We instantiate `ClusterCoordinator` as soon as `ParameterServerStrategy`is instantiated:
223
+
We propose that an attribute is added to `ParameterServerStrategy` to keep track of the `ClusterCoordinator`. When a `ClusterCooridinator` is instantiated, such attribute will be set. Here, we assume that the distribution `Strategy` object can determine whether or not it is supposed to be used with a `ClusterCoordinator`. See below “Changes in tf.distribute” section for more information.
224
224
225
225
```
226
+
class ClusterCoordinator(...):
227
+
def __init__(self, strategy):
228
+
self.strategy = weakref.ref(strategy)
229
+
strategy.cluster_coordinator = self
230
+
```
226
231
232
+
And, we instantiate the `ClusterCoordinator` as soon as `model.fit` is called for the first time. It will then be reused for the next `fit`, or on a different model.
227
233
228
234
```
229
235
class Model(...):
@@ -236,11 +242,7 @@ class Model(...):
236
242
... # the rest of fit
237
243
238
244
```
239
-
To avoid the circular referencing between `ParameterServerStrategy` and `ClusterCoordinator` and the resulting leak, the `coordinator`’s reference to `strategy` should be a `weakref`.
240
-
241
-
This option is with the assumption that there is always only one `ParameterServerStrategy` used*, and that we are not supporting the use case where the user creates an additional `ClusterCoordinator`.
242
-
243
-
*This is because there's currently not yet a clean way to shut down `ClusterCoordinator`, so we can't support more than one `ClusterCoordinator`, and thus no more than one `ParameterServerStrategy`.
245
+
To avoid the leak resulting from the circular referencing between `ParameterServerStrategy` and `ClusterCoordinator`, the `coordinator`’s reference to `strategy` should be a `weakref`.
244
246
245
247
246
248
#### Keras `Model` changes
@@ -610,7 +612,7 @@ Pros:
610
612
611
613
Cons:
612
614
* This solution presents a challenge when workers can easily become unavailable, in which case it is not straightforward to immediately find another available worker to take over*
613
-
* This solution is blocked on `tf.keras.models.load_model` being available on PS (if we have another cloning solution, that works too)
615
+
* This solution is blocked on `tf.keras.models.load_model` being available on PS, if `variable_partitioner` is used. Here, model saving and loading are for cloning the model, so if there is an alternative to clone, this solution is not blocked.
614
616
* Users who can afford to allocate a high priority on an evaluator task cannot do so with workers; workers would simply have the same, usually lower, priority (and thus more frequent function-takeovers)
615
617
616
618
*Fault tolerance, the first con, may further be addressed with possibly another `ClusterCoordinator`, if it shares the threads with the other `ClusterCoordinator`, and the library allows multiple function queues to be accessed by the threads. More details may be discussed in a separate RFC.
0 commit comments