|
194 | 194 | "id": "fLW6D2TzvC-4"
|
195 | 195 | },
|
196 | 196 | "source": [
|
197 |
| - "Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker-processes in this tutorial:" |
| 197 | + "Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker processes in this tutorial:" |
198 | 198 | ]
|
199 | 199 | },
|
200 | 200 | {
|
|
439 | 439 | "\n",
|
440 | 440 | "This tutorial demonstrates how to perform synchronous multi-worker training using an instance of `tf.distribute.MultiWorkerMirroredStrategy`.\n",
|
441 | 441 | "\n",
|
442 |
| - "`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The [`tf.distribute.Strategy` guide](../../guide/distributed_training.ipynb) has more details about this strategy." |
| 442 | + "`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The `tf.distribute.Strategy` [guide](../../guide/distributed_training.ipynb) has more details about this strategy." |
443 | 443 | ]
|
444 | 444 | },
|
445 | 445 | {
|
|
882 | 882 | "\n",
|
883 | 883 | "When a worker becomes unavailable, other workers will fail (possibly after a timeout). In such cases, the unavailable worker needs to be restarted, as well as other workers that have failed.\n",
|
884 | 884 | "\n",
|
885 |
| - "Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback." |
| 885 | + "Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single-worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback." |
886 | 886 | ]
|
887 | 887 | },
|
888 | 888 | {
|
|
1129 | 1129 | "\n",
|
1130 | 1130 | "The `BackupAndRestore` callback uses the `CheckpointManager` to save and restore the training state, which generates a file called checkpoint that tracks existing checkpoints together with the latest one. For this reason, `backup_dir` should not be re-used to store other checkpoints in order to avoid name collision.\n",
|
1131 | 1131 | "\n",
|
1132 |
| - "Currently, the `BackupAndRestore` callback supports single worker with no strategy, MirroredStrategy, and multi-worker with MultiWorkerMirroredStrategy.\n", |
1133 |
| - "Below are two examples for both multi-worker training and single worker training." |
| 1132 | + "Currently, the `BackupAndRestore` callback supports single-worker training with no strategy—`MirroredStrategy`—and multi-worker training with `MultiWorkerMirroredStrategy`.\n", |
| 1133 | + "\n", |
| 1134 | + "Below are two examples for both multi-worker training and single-worker training:" |
1134 | 1135 | ]
|
1135 | 1136 | },
|
1136 | 1137 | {
|
|
1141 | 1142 | },
|
1142 | 1143 | "outputs": [],
|
1143 | 1144 | "source": [
|
1144 |
| - "# Multi-worker training with MultiWorkerMirroredStrategy\n", |
1145 |
| - "# and the BackupAndRestore callback.\n", |
| 1145 | + "# Multi-worker training with `MultiWorkerMirroredStrategy`\n", |
| 1146 | + "# and the `BackupAndRestore` callback.\n", |
1146 | 1147 | "\n",
|
1147 |
| - "callbacks = [tf.keras.callbacks.BackupAndRestore(backup_dir='/tmp/backup')]\n", |
| 1148 | + "callbacks = [tf.keras.callbacks.experimental.BackupAndRestore(backup_dir='/tmp/backup')]\n", |
1148 | 1149 | "with strategy.scope():\n",
|
1149 | 1150 | " multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
|
1150 | 1151 | "multi_worker_model.fit(multi_worker_dataset,\n",
|
|
1183 | 1184 | "colab": {
|
1184 | 1185 | "collapsed_sections": [],
|
1185 | 1186 | "name": "multi_worker_with_keras.ipynb",
|
1186 |
| - "toc_visible": true |
| 1187 | + "provenance": [] |
1187 | 1188 | },
|
1188 | 1189 | "kernelspec": {
|
1189 | 1190 | "display_name": "Python 3",
|
|
0 commit comments