Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit 2173b10

Browse files
committed
Resolving 'The setup of with usage' part.
1 parent b9144a3 commit 2173b10

File tree

1 file changed

+37
-33
lines changed

1 file changed

+37
-33
lines changed

rfcs/20201121-keras-model-fit-ps.md

Lines changed: 37 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -197,10 +197,14 @@ For compatibility with other strategies, we propose that `dataset_fn` takes a si
197197

198198
*This is also in preparation for a multi-replica support in the future. See [tutorial](https://www.tensorflow.org/tutorials/distribute/parameter_server_training?hl=uk#dispatch_training_steps_to_remote_workers) for more information.
199199

200-
#### The setup of `ClusterCoordinator`
200+
#### The setup of `ClusterCoordinator` with `model.fit` usage
201+
202+
##### Basic use case: `ClusterCoordinator` being internal
201203

202204
To take advantage of TF2 support of parameter server training, a `ClusterCoordinator` should be created for handling asynchronous function scheduling and joining. The preferred route should be that such an object is abstracted away from the user with `model.fit` training API as an implementation detail, since we do not expect users to `schedule` functions themselves, or synchronize the cluster in the basic workflow.
203205

206+
##### Advanced use case: `ClusterCoordinator` as a singleton
207+
204208
Since `ClusterCoordinator` instance spins off worker and failure handling threads, there should only be one `ClusterCoordinator` at any given time, and making it a singleton ensures that those threads are only created once:
205209

206210
```
@@ -212,41 +216,13 @@ class ClusterCoordinator(object):
212216
return ClusterCoordinator.instance
213217
```
214218

215-
Being a singleton is important considering there are power users who would like to `schedule` functions themselves in addition to `model.fit` usage. That is, they can instantiate one before `model.fit` does, or use one after `model.fit` has instantiate one. In either case, they should access the same `ClusterCoordinator` instance.
219+
Being a singleton is important considering there are power users who would like to `schedule` functions themselves in addition to `model.fit` usage. That is, they can instantiate one before `model.fit` does, or use one after `model.fit` has instantiated one. In either case, they should access the same `ClusterCoordinator` instance.
216220

217-
In terms of who keeps track of the `ClusterCoordinator` for `model.fit`, and when it starts allocating threads, there are a few options. Here, we assume that the distribution `Strategy` object can determine whether or not it is supposed to be used with a `ClusterCoordinator`. See below “Changes in tf.distribute” section for more information.
218-
219-
220-
##### Option 1: Attach the `ClusterCoordinator`’s lifecycle to `model.fit`
221-
222-
With this option, an attribute is added to the `Model` that keeps track of the `ClusterCoordinator`, and it is instantiated when `model.fit` is called.
221+
##### Have an attribute in `ParameterServerStrategy` that holds the `ClusterCoordinator`
223222

223+
We propose that an attribute is added to the `ParameterServerStrategy` to keep track of the `ClusterCoordinator`. We instantiate `ClusterCoordinator` as soon as `ParameterServerStrategy` is instantiated:
224224

225225
```
226-
class Model(...):
227-
def __init__(self):
228-
self._cluster_coordinator = None
229-
...
230-
231-
def fit(self, ...):
232-
if (self.distribute_strategy.should_use_with_coordinator() and
233-
not self._cluster_coordinator):
234-
self._cluster_coordinator = cluster_coordinator.ClusterCoordinator(
235-
self.distribute_strategy)
236-
... # the rest of `fit`
237-
self._cluster_coordinator.shut_down() # Shut down at the end of `fit`
238-
self._cluster_coordinator = None
239-
240-
class ClusterCoodinator(object):
241-
def shut_down(self):
242-
# Join the threads and terminate resources. We don't have this implemented yet.
243-
```
244-
245-
246-
247-
##### Option 2: Have an attribute in `ParameterServerStrategy` that holds the `ClusterCoordinator`
248-
249-
With this option, an attribute is added to the `ParameterServerStrategy` to keep track of the `ClusterCoordinator`. We start the `ClusterCoordinator` as soon as the `model.fit` is called for the first time, and do not attempt to shut it down after `fit` completes. It will then be reused for the next `fit`, or on a different model.
250226
251227
252228
```
@@ -591,7 +567,7 @@ SidecarEvaluator(
591567
* also accept the checkpoint files saved by `ModelCheckpoint` callback for periodic evaluation.
592568
* accept arbitrary callbacks to be used in its internal `model.evaluate` call
593569
594-
##### An sidecar evaluation thread on coordinator
570+
##### A sidecar evaluation thread on coordinator
595571
596572
A potentially more seamless and encapsulated sidecar evaluation, where the user is not required to allocate an evaluator task or run separate code, can be done with an evaluation thread on the coordinator. This thread would remotely execute an evaluation function on a worker, and wait for its result synchronously. Once the result is returned, it can write a summary, adjust learning rate, or signal to end the training. Then, it re-`schedule`s an evaluation function, and so on:
597573
@@ -785,3 +761,31 @@ dataset = tf.data.Dataset.X... # Make use of `preproc_stage` for transformation
785761
history = model.fit(dataset, epochs=..., steps_per_epoch=..., callbacks=[...])
786762
logging.info("result: %r", history)
787763
```
764+
765+
766+
### Attach the `ClusterCoordinator`’s lifecycle to `model.fit`
767+
768+
With this option, an attribute is added to the `Model` that keeps track of the `ClusterCoordinator`, and it is instantiated when `model.fit` is called.
769+
770+
771+
```
772+
class Model(...):
773+
def __init__(self):
774+
self._cluster_coordinator = None
775+
...
776+
777+
def fit(self, ...):
778+
if (self.distribute_strategy.should_use_with_coordinator() and
779+
not self._cluster_coordinator):
780+
self._cluster_coordinator = cluster_coordinator.ClusterCoordinator(
781+
self.distribute_strategy)
782+
... # the rest of `fit`
783+
self._cluster_coordinator.shut_down() # Shut down at the end of `fit`
784+
self._cluster_coordinator = None
785+
786+
class ClusterCoodinator(object):
787+
def shut_down(self):
788+
# Join the threads and terminate resources. We don't have this implemented yet.
789+
```
790+
791+
At this time, we're proposing to have an attribute in `ParameterServerStrategy` that holds the `ClusterCoordinator` instead.

0 commit comments

Comments
 (0)