|
132 | 132 | "\n",
|
133 | 133 | "With custom training loops, the `tf.distribute.coordinator.ClusterCoordinator` class is the key component used for the coordinator.\n",
|
134 | 134 | "\n",
|
135 |
| - "- The `ClusterCoordinator` class needs to work in conjunction with a `tf.distribute.Strategy` object.\n", |
| 135 | + "- The `ClusterCoordinator` class needs to work in conjunction with a `tf.distribute.ParameterServerStrategy` object.\n", |
136 | 136 | "- This `tf.distribute.Strategy` object is needed to provide the information of the cluster and is used to define a training step, as demonstrated in [Custom training with tf.distribute.Strategy](custom_training.ipynb).\n",
|
137 | 137 | "- The `ClusterCoordinator` object then dispatches the execution of these training steps to remote workers.\n",
|
138 |
| - "- For parameter server training, the `ClusterCoordinator` needs to work with a `tf.distribute.ParameterServerStrategy`.\n", |
139 | 138 | "\n",
|
140 | 139 | "The most important API provided by the `ClusterCoordinator` object is `schedule`:\n",
|
141 | 140 | "\n",
|
|
883 | 882 | "source": [
|
884 | 883 | "## Evaluation\n",
|
885 | 884 | "\n",
|
886 |
| - "The two main approaches to performing evaluation with `tf.distribute.ParameterServerStrategy` training are inline evaluation and sidecar evaluation. Each has its own pros and cons as described below. The inline evaluation method is recommended if you don't have a preference." |
| 885 | + "The two main approaches to performing evaluation with `tf.distribute.ParameterServerStrategy` training are inline evaluation and sidecar evaluation. Each has its own pros and cons as described below. The inline evaluation method is recommended if you don't have a preference. For users using `Model.fit`, `Model.evaluate` uses inline (distributed) evaluation under the hood." |
887 | 886 | ]
|
888 | 887 | },
|
889 | 888 | {
|
|
985 | 984 | "id": "cKrQktZX5z7a"
|
986 | 985 | },
|
987 | 986 | "source": [
|
988 |
| - "Note: The `schedule` and `join` methods of `tf.distribute.coordinator.ClusterCoordinator` don’t support visitation guarantees or exactly-once semantics. In other words, there is no guarantee that all evaluation examples in a dataset will be evaluated exactly once; some may not be visited and some may be evaluated multiple times. The `tf.data` service API can be used to provide exactly-once visitation for evaluation when using `ParameterServerStrategy` (refer to the _Dynamic Sharding_ section of the `tf.data.experimental.service` API documentation)." |
| 987 | + "#### Enabling exactly-once evaluation\n", |
| 988 | + "<a id=\"exactly_once_evaluation\"></a>\n", |
| 989 | + "\n", |
| 990 | + "The `schedule` and `join` methods of `tf.distribute.coordinator.ClusterCoordinator` don’t support visitation guarantees or exactly-once semantics by default. In other words, in the above example there is no guarantee that all evaluation examples in a dataset will be evaluated exactly once; some may not be visited and some may be evaluated multiple times. \n", |
| 991 | + "\n", |
| 992 | + "Exactly-once evaluation may be preferred to reduce the variance of evaluation across epochs, and improve model selection done via early stopping, hyperparameter tuning, or other methods. There are different ways to enable exactly-once evaluation:\n", |
| 993 | + "\n", |
| 994 | + "- With a `Model.fit/.evaluate` workflow, it can be enabled by adding an argument to `Model.compile`. Refer to docs for the `pss_evaluation_shards` argument.\n", |
| 995 | + "- The `tf.data` service API can be used to provide exactly-once visitation for evaluation when using `ParameterServerStrategy` (refer to the _Dynamic Sharding_ section of the `tf.data.experimental.service` API documentation).\n", |
| 996 | + "- [Sidecar evaluation](#sidecar_evaluation) provides exactly-once evaluation by default, since the evaluation happens on a single machine. However this can be much slower than performing evaluation distributed across many workers.\n", |
| 997 | + "\n", |
| 998 | + "The first option, using `Model.compile`, is the suggested solution for most users. \n", |
| 999 | + "\n", |
| 1000 | + "Exactly-once evaluation has some limitations:\n", |
| 1001 | + "\n", |
| 1002 | + "- It is not supported to write a custom distributed evaluation loop with an exactly-once visitation guarantee. File a GitHub issue if you need support for this.\n", |
| 1003 | + "- It cannot automatically handle computation of metrics that use the `Layer.add_metric` API. These should be excluded from evaluation, or reworked into `Metric` objects." |
989 | 1004 | ]
|
990 | 1005 | },
|
991 | 1006 | {
|
|
997 | 1012 | "### Sidecar evaluation\n",
|
998 | 1013 | "<a id=\"sidecar_evaluation\"></a>\n",
|
999 | 1014 | "\n",
|
1000 |
| - "Another method for defining and running an evaluation loop in `tf.distribute.ParameterServerStrategy` training is called _sidecar evaluation_, in which you create a dedicated evaluator task that repeatedly reads checkpoints and runs evaluation on the latest checkpoint (refer to [this guide](../../guide/checkpoint.ipynb) for more details on checkpointing). The chief and worker tasks do not spend any time on evaluation, so for a fixed number of iterations the overall training time should be shorter than using other evaluation methods. However, it requires an additional evaluator task and periodic checkpointing to trigger evaluation." |
| 1015 | + "Another method for defining and running an evaluation loop in `tf.distribute.ParameterServerStrategy` training is called _sidecar evaluation_, in which you create a dedicated evaluator task that repeatedly reads checkpoints and runs evaluation on the latest checkpoint (refer to [this guide](../../guide/checkpoint.ipynb) for more details on checkpointing). The coordinator and worker tasks do not spend any time on evaluation, so for a fixed number of iterations the overall training time should be shorter than using other evaluation methods. However, it requires an additional evaluator task and periodic checkpointing to trigger evaluation." |
1001 | 1016 | ]
|
1002 | 1017 | },
|
1003 | 1018 | {
|
|
1348 | 1363 | "source": [
|
1349 | 1364 | "### Custom training loop specifics\n",
|
1350 | 1365 | "\n",
|
1351 |
| - "- `ClusterCoordinator.schedule` doesn't support visitation guarantees for a dataset.\n", |
| 1366 | + "- `ClusterCoordinator.schedule` doesn't support visitation guarantees for a dataset in general, although a visitation guarantee for evaluation is possible through `Model.fit/.evaluate`. See [Enabling exactly-once evaluation](#exactly_once_evaluation).\n", |
1352 | 1367 | "- When `ClusterCoordinator.create_per_worker_dataset` is used with a callable as input, the whole dataset must be created inside the function passed to it.\n",
|
1353 | 1368 | "- `tf.data.Options` is ignored in a dataset created by `ClusterCoordinator.create_per_worker_dataset`."
|
1354 | 1369 | ]
|
|
1357 | 1372 | "metadata": {
|
1358 | 1373 | "accelerator": "GPU",
|
1359 | 1374 | "colab": {
|
1360 |
| - "collapsed_sections": [], |
1361 | 1375 | "name": "parameter_server_training.ipynb",
|
1362 | 1376 | "provenance": [],
|
1363 | 1377 | "toc_visible": true
|
|
0 commit comments