Merge pull request #2002 from Obliman:master

copybara-github · copybara-github · commit a1c5b3cbab2f · 2022-02-03T20:18:37.000-08:00
PiperOrigin-RevId: 426301644
diff --git a/site/en/tutorials/distribute/multi_worker_with_keras.ipynb b/site/en/tutorials/distribute/multi_worker_with_keras.ipynb
@@ -194,7 +194,7 @@
         "id": "fLW6D2TzvC-4"
       },
       "source": [
-        "Next, create an `mnist.py` file with a simple model and dataset setup. This Python file will be used by the worker-processes in this tutorial:"
+        "Next, create an `mnist_setup.py` file with a simple model and dataset setup. This Python file will be used by the worker processes in this tutorial:"
       ]
     },
     {
@@ -205,7 +205,7 @@
       },
       "outputs": [],
       "source": [
-        "%%writefile mnist.py\n",
+        "%%writefile mnist_setup.py\n",
         "\n",
         "import os\n",
         "import tensorflow as tf\n",
@@ -256,11 +256,11 @@
       },
       "outputs": [],
       "source": [
-        "import mnist\n",
+        "import mnist_setup\n",
         "\n",
         "batch_size = 64\n",
-        "single_worker_dataset = mnist.mnist_dataset(batch_size)\n",
-        "single_worker_model = mnist.build_and_compile_cnn_model()\n",
+        "single_worker_dataset = mnist_setup.mnist_dataset(batch_size)\n",
+        "single_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
         "single_worker_model.fit(single_worker_dataset, epochs=3, steps_per_epoch=70)"
       ]
     },
@@ -439,7 +439,7 @@
         "\n",
         "This tutorial demonstrates how to perform synchronous multi-worker training using an instance of `tf.distribute.MultiWorkerMirroredStrategy`.\n",
         "\n",
-        "`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync.  The [`tf.distribute.Strategy` guide](../../guide/distributed_training.ipynb) has more details about this strategy."
+        "`MultiWorkerMirroredStrategy` creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync.  The `tf.distribute.Strategy` [guide](../../guide/distributed_training.ipynb) has more details about this strategy."
       ]
     },
     {
@@ -459,7 +459,7 @@
         "id": "N0iv7SyyAohc"
       },
       "source": [
-        "Note: `TF_CONFIG` is parsed and TensorFlow's GRPC servers are started at the time `MultiWorkerMirroredStrategy()` is called, so the `TF_CONFIG` environment variable must be set before a `tf.distribute.Strategy` instance is created. Since `TF_CONFIG` is not set yet, the above strategy is effectively single-worker training."
+        "Note: `TF_CONFIG` is parsed and TensorFlow's GRPC servers are started at the time `MultiWorkerMirroredStrategy` is called, so the `TF_CONFIG` environment variable must be set before a `tf.distribute.Strategy` instance is created. Since `TF_CONFIG` is not set yet, the above strategy is effectively single-worker training."
       ]
     },
     {
@@ -468,7 +468,7 @@
         "id": "FMy2VM4Akzpr"
       },
       "source": [
-        "`MultiWorkerMirroredStrategy` provides multiple implementations via the [`CommunicationOptions`](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/CommunicationOptions) parameter: 1) `RING` implements ring-based collectives using gRPC as the cross-host communication layer; 2) `NCCL` uses the [NVIDIA Collective Communication Library](https://developer.nvidia.com/nccl) to implement collectives; and 3) `AUTO` defers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster."
+        "`MultiWorkerMirroredStrategy` provides multiple implementations via the `tf.distribute.experimental.CommunicationOptions` parameter: 1) `RING` implements ring-based collectives using gRPC as the cross-host communication layer; 2) `NCCL` uses the [NVIDIA Collective Communication Library](https://developer.nvidia.com/nccl) to implement collectives; and 3) `AUTO` defers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster."
       ]
     },
     {
@@ -492,7 +492,7 @@
       "source": [
         "with strategy.scope():\n",
         "  # Model building/compiling need to be within `strategy.scope()`.\n",
-        "  multi_worker_model = mnist.build_and_compile_cnn_model()"
+        "  multi_worker_model = mnist_setup.build_and_compile_cnn_model()"
       ]
     },
     {
@@ -512,7 +512,7 @@
       "source": [
         "To actually run with `MultiWorkerMirroredStrategy` you'll need to run worker processes and pass a `TF_CONFIG` to them.\n",
         "\n",
-        "Like the `mnist.py` file written earlier, here is the `main.py` that each of the workers will run:"
+        "Like the `mnist_setup.py` file written earlier, here is the `main.py` that each of the workers will run:"
       ]
     },
     {
@@ -529,7 +529,7 @@
         "import json\n",
         "\n",
         "import tensorflow as tf\n",
-        "import mnist\n",
+        "import mnist_setup\n",
         "\n",
         "per_worker_batch_size = 64\n",
         "tf_config = json.loads(os.environ['TF_CONFIG'])\n",
@@ -538,11 +538,11 @@
         "strategy = tf.distribute.MultiWorkerMirroredStrategy()\n",
         "\n",
         "global_batch_size = per_worker_batch_size * num_workers\n",
-        "multi_worker_dataset = mnist.mnist_dataset(global_batch_size)\n",
+        "multi_worker_dataset = mnist_setup.mnist_dataset(global_batch_size)\n",
         "\n",
         "with strategy.scope():\n",
         "  # Model building/compiling need to be within `strategy.scope()`.\n",
-        "  multi_worker_model = mnist.build_and_compile_cnn_model()\n",
+        "  multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
         "\n",
         "\n",
         "multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)"
@@ -820,7 +820,7 @@
         "options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF\n",
         "\n",
         "global_batch_size = 64\n",
-        "multi_worker_dataset = mnist.mnist_dataset(batch_size=64)\n",
+        "multi_worker_dataset = mnist_setup.mnist_dataset(batch_size=64)\n",
         "dataset_no_auto_shard = multi_worker_dataset.with_options(options)"
       ]
     },
@@ -882,7 +882,7 @@
         "\n",
         "When a worker becomes unavailable, other workers will fail (possibly after a timeout). In such cases, the unavailable worker needs to be restarted, as well as other workers that have failed.\n",
         "\n",
-        "Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, to also add the support to single worker training for a consistent experience, and removed fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new callback."
+        "Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, which also adds the support to single-worker training for a consistent experience, and removed the fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new `BackupAndRestore` callback."
       ]
     },
     {
@@ -1129,8 +1129,9 @@
         "\n",
         "The `BackupAndRestore` callback uses the `CheckpointManager` to save and restore the training state, which generates a file called checkpoint that tracks existing checkpoints together with the latest one. For this reason, `backup_dir` should not be re-used to store other checkpoints in order to avoid name collision.\n",
         "\n",
-        "Currently, the `BackupAndRestore` callback supports single worker with no strategy, MirroredStrategy, and multi-worker with MultiWorkerMirroredStrategy.\n",
-        "Below are two examples for both multi-worker training and single worker training."
+        "Currently, the `BackupAndRestore` callback supports single-worker training with no strategy—`MirroredStrategy`—and multi-worker training with `MultiWorkerMirroredStrategy`.\n",
+        "\n",
+        "Below are two examples for both multi-worker training and single-worker training:"
       ]
     },
     {
@@ -1141,12 +1142,12 @@
       },
       "outputs": [],
       "source": [
-        "# Multi-worker training with MultiWorkerMirroredStrategy\n",
-        "# and the BackupAndRestore callback.\n",
+        "# Multi-worker training with `MultiWorkerMirroredStrategy`\n",
+        "# and the `BackupAndRestore` callback.\n",
         "\n",
         "callbacks = [tf.keras.callbacks.BackupAndRestore(backup_dir='/tmp/backup')]\n",
         "with strategy.scope():\n",
-        "  multi_worker_model = mnist.build_and_compile_cnn_model()\n",
+        "  multi_worker_model = mnist_setup.build_and_compile_cnn_model()\n",
         "multi_worker_model.fit(multi_worker_dataset,\n",
         "                       epochs=3,\n",
         "                       steps_per_epoch=70,\n",