Update Multi-worker training with Keras

8bitmp3 · web-flow · commit 789a89c0a1ff · 2022-02-03T01:11:42.000-08:00
diff --git a/site/en/tutorials/distribute/multi_worker_with_keras.ipynb b/site/en/tutorials/distribute/multi_worker_with_keras.ipynb
@@ -194,7 +194,7 @@
         "id": "fLW6D2TzvC-4"
       },
       "source": [
-        "Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker-processes in this tutorial:"
+        "Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker processes in this tutorial:"
       ]
     },
     {
@@ -439,7 +439,7 @@
         "\n",
         "This tutorial demonstrates how to perform synchronous multi-worker training using an instance of `tf.distribute.MultiWorkerMirroredStrategy`.\n",
         "\n",
-        "`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync.  The [`tf.distribute.Strategy` guide](../../guide/distributed_training.ipynb) has more details about this strategy."
+        "`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync.  The `tf.distribute.Strategy` [guide](../../guide/distributed_training.ipynb) has more details about this strategy."
       ]
     },
     {
@@ -882,7 +882,7 @@
         "\n",
         "When a worker becomes unavailable, other workers will fail (possibly after a timeout). In such cases, the unavailable worker needs to be restarted, as well as other workers that have failed.\n",
         "\n",
-        "Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback."
+        "Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single-worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback."
       ]
     },
     {
@@ -1129,8 +1129,9 @@
         "\n",
         "The `BackupAndRestore` callback uses the `CheckpointManager` to save and restore the training state, which generates a file called checkpoint that tracks existing checkpoints together with the latest one. For this reason, `backup_dir` should not be re-used to store other checkpoints in order to avoid name collision.\n",
         "\n",
-        "Currently, the `BackupAndRestore` callback supports single worker with no strategy, MirroredStrategy, and multi-worker with MultiWorkerMirroredStrategy.\n",
-        "Below are two examples for both multi-worker training and single worker training."
+        "Currently, the `BackupAndRestore` callback supports single-worker training with no strategy—`MirroredStrategy`—and multi-worker training with `MultiWorkerMirroredStrategy`.\n",
+        "\n",
+        "Below are two examples for both multi-worker training and single-worker training:"
       ]
     },
     {
@@ -1141,10 +1142,10 @@
       },
       "outputs": [],
       "source": [
-        "# Multi-worker training with MultiWorkerMirroredStrategy\n",
-        "# and the BackupAndRestore callback.\n",
+        "# Multi-worker training with `MultiWorkerMirroredStrategy`\n",
+        "# and the `BackupAndRestore` callback.\n",
         "\n",
-        "callbacks = [tf.keras.callbacks.BackupAndRestore(backup_dir='/tmp/backup')]\n",
+        "callbacks = [tf.keras.callbacks.experimental.BackupAndRestore(backup_dir='/tmp/backup')]\n",
         "with strategy.scope():\n",
         "  multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
         "multi_worker_model.fit(multi_worker_dataset,\n",
@@ -1183,7 +1184,7 @@
     "colab": {
       "collapsed_sections": [],
       "name": "multi_worker_with_keras.ipynb",
-      "toc_visible": true
+      "provenance": []
     },
     "kernelspec": {
       "display_name": "Python 3",