Skip to content

Commit 789a89c

Browse files
authored
Update Multi-worker training with Keras
1 parent 2d9aa29 commit 789a89c

File tree

1 file changed

+10
-9
lines changed

1 file changed

+10
-9
lines changed

site/en/tutorials/distribute/multi_worker_with_keras.ipynb

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@
194194
"id": "fLW6D2TzvC-4"
195195
},
196196
"source": [
197-
"Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker-processes in this tutorial:"
197+
"Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker processes in this tutorial:"
198198
]
199199
},
200200
{
@@ -439,7 +439,7 @@
439439
"\n",
440440
"This tutorial demonstrates how to perform synchronous multi-worker training using an instance of `tf.distribute.MultiWorkerMirroredStrategy`.\n",
441441
"\n",
442-
"`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The [`tf.distribute.Strategy` guide](../../guide/distributed_training.ipynb) has more details about this strategy."
442+
"`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The `tf.distribute.Strategy` [guide](../../guide/distributed_training.ipynb) has more details about this strategy."
443443
]
444444
},
445445
{
@@ -882,7 +882,7 @@
882882
"\n",
883883
"When a worker becomes unavailable, other workers will fail (possibly after a timeout). In such cases, the unavailable worker needs to be restarted, as well as other workers that have failed.\n",
884884
"\n",
885-
"Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback."
885+
"Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single-worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback."
886886
]
887887
},
888888
{
@@ -1129,8 +1129,9 @@
11291129
"\n",
11301130
"The `BackupAndRestore` callback uses the `CheckpointManager` to save and restore the training state, which generates a file called checkpoint that tracks existing checkpoints together with the latest one. For this reason, `backup_dir` should not be re-used to store other checkpoints in order to avoid name collision.\n",
11311131
"\n",
1132-
"Currently, the `BackupAndRestore` callback supports single worker with no strategy, MirroredStrategy, and multi-worker with MultiWorkerMirroredStrategy.\n",
1133-
"Below are two examples for both multi-worker training and single worker training."
1132+
"Currently, the `BackupAndRestore` callback supports single-worker training with no strategy—`MirroredStrategy`—and multi-worker training with `MultiWorkerMirroredStrategy`.\n",
1133+
"\n",
1134+
"Below are two examples for both multi-worker training and single-worker training:"
11341135
]
11351136
},
11361137
{
@@ -1141,10 +1142,10 @@
11411142
},
11421143
"outputs": [],
11431144
"source": [
1144-
"# Multi-worker training with MultiWorkerMirroredStrategy\n",
1145-
"# and the BackupAndRestore callback.\n",
1145+
"# Multi-worker training with `MultiWorkerMirroredStrategy`\n",
1146+
"# and the `BackupAndRestore` callback.\n",
11461147
"\n",
1147-
"callbacks = [tf.keras.callbacks.BackupAndRestore(backup_dir='/tmp/backup')]\n",
1148+
"callbacks = [tf.keras.callbacks.experimental.BackupAndRestore(backup_dir='/tmp/backup')]\n",
11481149
"with strategy.scope():\n",
11491150
" multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
11501151
"multi_worker_model.fit(multi_worker_dataset,\n",
@@ -1183,7 +1184,7 @@
11831184
"colab": {
11841185
"collapsed_sections": [],
11851186
"name": "multi_worker_with_keras.ipynb",
1186-
"toc_visible": true
1187+
"provenance": []
11871188
},
11881189
"kernelspec": {
11891190
"display_name": "Python 3",

0 commit comments

Comments
 (0)