Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit 01f1814

Browse files
committed
Update sidecar eval thread part.
1 parent 16cb2be commit 01f1814

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

rfcs/20201121-keras-model-fit-ps.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -625,10 +625,12 @@ Pros (advantages of evaluator thread approach):
625625
Cons (disadvantages of evaluator thread approach):
626626
* This solution presents a challenge when workers can easily become unavailable, in which case it is not straightforward to immediately find another available worker to take over*
627627
* This solution is blocked on `tf.keras.models.load_model` being available on PS, if `variable_partitioner` is used. Here, model saving and loading are for cloning the model, so if there is an alternative to clone, this solution is not blocked.
628-
* Users who can afford to allocate a high priority on an evaluator task cannot do so with workers; workers would simply have the same, usually lower, priority (and thus more frequent function-takeovers)
628+
* Users who can afford to allocate a high priority on an evaluator task cannot do so with workers; workers would simply have the same, usually lower, priority (and thus more frequent function-takeovers)*
629629

630630
*Fault tolerance, the first con, may further be addressed with possibly another `ClusterCoordinator`, if it shares the threads with the other `ClusterCoordinator`, and the library allows multiple function queues to be accessed by the threads. More details may be discussed in a separate RFC.
631631

632+
*Regarding priority, the third con, we can address it by having a separate job (with only one task for now), say "eval_worker", for the worker that is solely used for evaluation. It'd be a little more work where TF_CONFIG, device filter, etc. need to be changed, but it is possible. It gives us the flexibility to assign a higher job priority.
633+
632634
### Fault tolerance
633635

634636
There are two goals of fault tolerance in multi-worker training:

0 commit comments

Comments
 (0)