You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 10, 2025. It is now read-only.
Being a singleton is important considering there are power users who would like to `schedule` functions themselves in addition to `model.fit` usage. That is, they can instantiate one before `model.fit` does, or use one after `model.fit` has instantiated one. In either case, they should access the same `ClusterCoordinator` instance.
@@ -236,9 +236,8 @@ class Model(...):
236
236
237
237
def fit(self, ...):
238
238
if (self.distribute_strategy.should_use_with_coordinator() and
239
-
not self.distribute_strategy._cluster_coordinator):
@@ -486,13 +486,12 @@ Similarly, the hyper and slot variables an `optimizer` object uses, would be cre
486
486
487
487
Initially, we aim to have `model.evaluate` and `model.predict` to only be carried out on the coordinator. That is, it does not involve distribution via a `ClusterCoordinator`, and thus the evaluate function is executed on the coordinator.
488
488
489
-
In the longer term, we seek distributed support for `model.evaluate`, where the evaluate function is scheduled onto the workers to execute. Visitation guarantee cannot be supported currently with the parameter server training API, so we can implement distributed evaluation without it, or wait until that is supported, and integrate it. Things possibly involved with distributed `model.evaluate` include:
489
+
In the longer term, we seek distributed support for `model.evaluate`, where the evaluate function is scheduled onto the workers to execute. The current `ClusterCoordinator`API has a limitation where distributed evaluation does not have visitation guarantee, when workers can become unavailable. Thus, we have a couple of options:
490
490
491
-
* support for local variables
492
-
* support for local resources
493
-
* efficient skipping of dataset batches or `dataset.shard` can be tf.function'ed
491
+
1. Implement distributed `model.evaluate` without visitation guarantee, but require user's opt-in because of the behavior change (by `model.evaluate(..., distributed_eval=True)`)
492
+
2. Support distributed `model.evaluate` only after `ClusterCoordinator` provides visitation guarantee mechanism
494
493
495
-
With those, we do not expect an API change at`model.fit` level, but if we do encounter something that results in a change, it is reasonable to add an argument `model.fit(distribute_eval=...)`.
494
+
Note that similar to the dataset factory change for`model.fit`, validation dataset will also need to be a function. That is, `model.fit` will take a `validation_data_fn` instead of a `validation_data`, and `model.evaluate` will take a `dataset_fn` as opposed to a `dataset` instance.
496
495
497
496
See below “Evaluation” section for other proposed evaluation solutions accompanying `model.fit` usage.
498
497
@@ -535,15 +534,15 @@ In addition to the existing train-evaluate solution provided by `model.fit`, we
535
534
536
535
#### Built-in, alternating evaluation in `model.fit`
537
536
538
-
If `validation_data` argument is provided, and certain conditions are satisfied, `model.fit` also runs evaluation via `model.evaluate` API every epoch, in an train-evaluate alternating manner. As described above, at this time, only the coordinator is used for `model.evaluate` evaluation, and we plan to extend this to worker-distributed evaluation when visitation guarantee is supported.
537
+
If `validation_data` argument is provided, and certain conditions are satisfied, `model.fit` also runs evaluation via `model.evaluate` API every epoch, in an train-evaluate alternating manner. As described above, at this time, only the coordinator is used for `model.evaluate` evaluation, and we plan to extend this to worker-distributed evaluation when visitation guarantee is supported. See above "model.evaluate" section for more information.
539
538
540
539
#### Sidecar evaluation
541
540
542
-
In addition to the built-in evaluation `model.fit` provides, sidecar evaluation is also supported with a [recommended user flow](https://www.tensorflow.org/tutorials/distribute/parameter_server_training#side-car_evaluation).
541
+
In addition to the built-in evaluation `model.fit` provides, sidecar evaluation is also supported. Currently, we have a [recommended user flow](https://www.tensorflow.org/tutorials/distribute/parameter_server_training#side-car_evaluation) using a sidecar evaluator task for CTL users. The section discusses the proposed changes in sidecar evaluator accompanying `model.fit` usage with parameter server training.
543
542
544
-
##### SidecarEvaluator API
543
+
##### A sidecar evaluator task
545
544
546
-
We plan to propose a `SidecarEvaluator` API in a separate RFC for user’s convenience: with this, user is expected to kick start an additional task `evaluator`, in which the python program runs a `SidecarEvaluator` as follows:
545
+
In the short term, a task that is allocated for evaluation (aka sidecar evaluator) continues to be the recommended evaluation solution for PS training. We plan to propose a `SidecarEvaluator` API in a separate RFC for user’s convenience: with this, user is expected to kick start an additional task `evaluator`, in which the python program runs a `SidecarEvaluator` as follows:
547
546
548
547
549
548
```
@@ -571,7 +570,9 @@ SidecarEvaluator(
571
570
572
571
##### A sidecar evaluation thread on coordinator
573
572
574
-
A potentially more seamless and encapsulated sidecar evaluation, where the user is not required to allocate an evaluator task or run separate code, can be done with an evaluation thread on the coordinator. This thread would remotely execute an evaluation function on a worker, and wait for its result synchronously. Once the result is returned, it can write a summary, adjust learning rate, or signal to end the training. Then, it re-`schedule`s an evaluation function, and so on:
573
+
A potentially more seamless and encapsulated sidecar evaluation, where the user is not required to allocate an evaluator task or run separate code (for evaluation), can be done with an evaluation thread on the coordinator. With this approach, the user does not allocate a task with type 'evaluator', because one 'worker' task (that runs a `tf.distribute.Server`) from the cluster can be used for evaluation. It can be any of the workers, but for convenience, let’s say the Nth worker is used for evaluation.
574
+
575
+
The thread would be started by `model.fit`, if the user expresses to opt in via an argument such as `fit(..., run_sidecar_eval_thread=True)`. The thread would remotely execute an evaluation function on this worker #N, and wait for its result synchronously. Once the result is returned, it can write a summary, adjust learning rate, or signal to end the training. After that, it re-`schedule`s an evaluation function, and so on:
575
576
576
577
```
577
578
class Model(...):
@@ -595,22 +596,26 @@ class Model(...):
595
596
tmp_logs = self.test_function(iterator)
596
597
... # Callbacks, etc.
597
598
598
-
def fit(self, ...):
599
-
# At some point, we start a thread for sidecar eval
600
-
t = threading.Thread(target=self._continuously_evaluate)
# At some point, we start a thread for sidecar eval
602
+
t = threading.Thread(target=self._continuously_evaluate)
603
+
t.start()
604
+
...
605
+
if run_sidecar_eval_thread:
606
+
self.should_eval = False
607
+
t.join()
605
608
```
606
609
610
+
Note that with this approach, the training cluster will be limited to the first N-1 workers it has remaining, so the training cluster and evaluation do not block each other.
611
+
607
612
If we compare the sidecar evaluator thread solution vs sidecar evaluator task (process):
608
613
609
-
Pros:
610
-
* This does not require a task to be set aside as evaluator
614
+
Pros (advantages of evaluator thread approach):
615
+
* This does not require a task to be set aside as evaluator, so 1) less work on the user, and 2) there is one fewer version of python binary
611
616
* There is easier communication between the sidecar evaluator (thread) and the coordinator main thread, which is important for many callbacks
612
617
613
-
Cons:
618
+
Cons (disadvantages of evaluator thread approach):
614
619
* This solution presents a challenge when workers can easily become unavailable, in which case it is not straightforward to immediately find another available worker to take over*
615
620
* This solution is blocked on `tf.keras.models.load_model` being available on PS, if `variable_partitioner` is used. Here, model saving and loading are for cloning the model, so if there is an alternative to clone, this solution is not blocked.
616
621
* Users who can afford to allocate a high priority on an evaluator task cannot do so with workers; workers would simply have the same, usually lower, priority (and thus more frequent function-takeovers)
0 commit comments