Review and update public doc "Custom training with tf.distribute.Strategy"

hubingallin · copybara-github · commit 44a3c7bafb44 · 2022-04-07T15:33:17.000-07:00
PiperOrigin-RevId: 440216715
diff --git a/site/en/tutorials/distribute/custom_training.ipynb b/site/en/tutorials/distribute/custom_training.ipynb
@@ -68,9 +68,9 @@
         "id": "FbVhjPpzn6BM"
       },
       "source": [
-        "This tutorial demonstrates how to use [`tf.distribute.Strategy`](https://www.tensorflow.org/guide/distributed_training) with custom training loops. We will train a simple CNN model on the fashion MNIST dataset. The fashion MNIST dataset contains 60000 train images of size 28 x 28 and 10000 test images of size 28 x 28.\n",
+        "This tutorial demonstrates how to use `tf.distribute.Strategy` — a TensorFlow API that provides an abstraction for [distributing your training](../../guide/distributed_training.ipynb) across multiple processing units (GPUs, multiple machines, or TPUs) — with custom training loops. In this example, you will train a simple convolutional neural network on the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) containing 70,000 images of size 28 x 28.\n",
         "\n",
-        "We are using custom training loops to train our model because they give us flexibility and a greater control on training. Moreover, it is easier to debug the model and the training loop."
+        "[Custom training loops](../customization/custom_training_walkthrough.ipynb) provide flexibility and a greater control on training. They also make it is easier to debug the model and the training loop."
       ]
     },
     {
@@ -97,7 +97,7 @@
         "id": "MM6W__qraV55"
       },
       "source": [
-        "## Download the fashion MNIST dataset"
+        "## Download the Fashion MNIST dataset"
       ]
     },
     {
@@ -112,14 +112,14 @@
         "\n",
         "(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()\n",
         "\n",
-        "# Adding a dimension to the array -> new shape == (28, 28, 1)\n",
-        "# We are doing this because the first layer in our model is a convolutional\n",
+        "# Add a dimension to the array -> new shape == (28, 28, 1)\n",
+        "# This is done because the first layer in our model is a convolutional\n",
         "# layer and it requires a 4D input (batch_size, height, width, channels).\n",
         "# batch_size dimension will be added later on.\n",
         "train_images = train_images[..., None]\n",
         "test_images = test_images[..., None]\n",
         "\n",
-        "# Getting the images in [0, 1] range.\n",
+        "# Scale the images to the [0, 1] range.\n",
         "train_images = train_images / np.float32(255)\n",
         "test_images = test_images / np.float32(255)"
       ]
@@ -141,13 +141,13 @@
       "source": [
         "How does `tf.distribute.MirroredStrategy` strategy work?\n",
         "\n",
-        "*   All the variables and the model graph is replicated on the replicas.\n",
+        "*   All the variables and the model graph are replicated across the replicas.\n",
         "*   Input is evenly distributed across the replicas.\n",
         "*   Each replica calculates the loss and gradients for the input it received.\n",
         "*   The gradients are synced across all the replicas by summing them.\n",
         "*   After the sync, the same update is made to the copies of the variables on each replica.\n",
         "\n",
-        "Note: You can put all the code below inside a single scope. We are dividing it into several code cells for illustration purposes.\n"
+        "Note: You can put all the code below inside a single scope. This example divides it into several code cells for illustration purposes.\n"
       ]
     },
     {
@@ -158,8 +158,8 @@
       },
       "outputs": [],
       "source": [
-        "# If the list of devices is not specified in the\n",
-        "# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.\n",
+        "# If the list of devices is not specified in\n",
+        "# `tf.distribute.MirroredStrategy` constructor, they will be auto-detected.\n",
         "strategy = tf.distribute.MirroredStrategy()"
       ]
     },
@@ -171,7 +171,7 @@
       },
       "outputs": [],
       "source": [
-        "print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))"
+        "print('Number of devices: {}'.format(strategy.num_replicas_in_sync))"
       ]
     },
     {
@@ -183,15 +183,6 @@
         "## Setup input pipeline"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "0Qb6nDgxiN_n"
-      },
-      "source": [
-        "Export the graph and the variables to the platform-agnostic SavedModel format. After your model is saved, you can load it with or without the scope."
-      ]
-    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -240,7 +231,7 @@
       "source": [
         "## Create the model\n",
         "\n",
-        "Create a model using `tf.keras.Sequential`. You can also use the Model Subclassing API to do this."
+        "Create a model using `tf.keras.Sequential`. You can also use the [Model Subclassing API](https://www.tensorflow.org/guide/keras/custom_layers_and_models) or the [functional API](https://www.tensorflow.org/guide/keras/functional) to do this."
       ]
     },
     {
@@ -286,14 +277,14 @@
       "source": [
         "## Define the loss function\n",
         "\n",
-        "Normally, on a single machine with 1 GPU/CPU, loss is divided by the number of examples in the batch of input.\n",
+        "Normally, on a single machine with single GPU/CPU, loss is divided by the number of examples in the batch of input.\n",
         "\n",
         "*So, how should the loss be calculated when using a `tf.distribute.Strategy`?*\n",
         "\n",
         "* For an example, let's say you have 4 GPU's and a batch size of 64. One batch of input is distributed\n",
         "across the replicas (4 GPUs), each replica getting an input of size 16.\n",
         "\n",
-        "* The model on each replica does a forward pass with its respective input and calculates the loss. Now, instead of dividing the loss by the number of examples in its respective input (BATCH_SIZE_PER_REPLICA = 16), the loss should be divided by the GLOBAL_BATCH_SIZE (64)."
+        "* The model on each replica does a forward pass with its respective input and calculates the loss. Now, instead of dividing the loss by the number of examples in its respective input (`BATCH_SIZE_PER_REPLICA` = 16), the loss should be divided by the `GLOBAL_BATCH_SIZE` (64)."
       ]
     },
     {
@@ -315,10 +306,10 @@
       "source": [
         "*How to do this in TensorFlow?*\n",
         "\n",
-        "* If you're writing a custom training loop, as in this tutorial, you should sum the per example losses and divide the sum by the GLOBAL_BATCH_SIZE: \n",
+        "* If you're writing a custom training loop, as in this tutorial, you should sum the per example losses and divide the sum by the `GLOBAL_BATCH_SIZE`: \n",
         "`scale_loss = tf.reduce_sum(loss) * (1. / GLOBAL_BATCH_SIZE)`\n",
         "or you can use `tf.nn.compute_average_loss` which takes the per example loss,\n",
-        "optional sample weights, and GLOBAL_BATCH_SIZE as arguments and returns the scaled loss.\n",
+        "optional sample weights, and `GLOBAL_BATCH_SIZE` as arguments and returns the scaled loss.\n",
         "\n",
         "* If you are using regularization losses in your model then you need to scale\n",
         "the loss value by number of replicas. You can do this by using the `tf.nn.scale_regularization_loss` function.\n",
@@ -351,7 +342,7 @@
       "outputs": [],
       "source": [
         "with strategy.scope():\n",
-        "  # Set reduction to `none` so we can do the reduction afterwards and divide by\n",
+        "  # Set reduction to `NONE` so you can do the reduction afterwards and divide by\n",
         "  # global batch size.\n",
         "  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(\n",
         "      from_logits=True,\n",
@@ -484,9 +475,9 @@
         "\n",
         "  template = (\"Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, \"\n",
         "              \"Test Accuracy: {}\")\n",
-        "  print (template.format(epoch+1, train_loss,\n",
-        "                         train_accuracy.result()*100, test_loss.result(),\n",
-        "                         test_accuracy.result()*100))\n",
+        "  print(template.format(epoch + 1, train_loss,\n",
+        "                         train_accuracy.result() * 100, test_loss.result(),\n",
+        "                         test_accuracy.result() * 100))\n",
         "\n",
         "  test_loss.reset_states()\n",
         "  train_accuracy.reset_states()\n",
@@ -501,7 +492,7 @@
       "source": [
         "Things to note in the example above:\n",
         "\n",
-        "* We are iterating over the `train_dist_dataset` and `test_dist_dataset` using  a `for x in ...` construct.\n",
+        "* Iterate over the `train_dist_dataset` and `test_dist_dataset` using  a `for x in ...` construct.\n",
         "* The scaled loss is the return value of the `distributed_train_step`. This value is aggregated across replicas using the `tf.distribute.Strategy.reduce` call and then across batches by summing the return value of the `tf.distribute.Strategy.reduce` calls.\n",
         "* `tf.keras.Metrics` should be updated inside `train_step` and `test_step` that gets executed by `tf.distribute.Strategy.run`.\n",
         "*`tf.distribute.Strategy.run` returns results from each local replica in the strategy, and there are multiple ways to consume this result. You can do `tf.distribute.Strategy.reduce` to get an aggregated value. You can also do `tf.distribute.Strategy.experimental_local_results` to get the list of values contained in the result, one per local replica.\n"
@@ -570,8 +561,8 @@
         "for images, labels in test_dataset:\n",
         "  eval_step(images, labels)\n",
         "\n",
-        "print ('Accuracy after restoring the saved model without strategy: {}'.format(\n",
-        "    eval_accuracy.result()*100))"
+        "print('Accuracy after restoring the saved model without strategy: {}'.format(\n",
+        "    eval_accuracy.result() * 100))"
       ]
     },
     {
@@ -606,7 +597,7 @@
         "  average_train_loss = total_loss / num_batches\n",
         "\n",
         "  template = (\"Epoch {}, Loss: {}, Accuracy: {}\")\n",
-        "  print (template.format(epoch+1, average_train_loss, train_accuracy.result()*100))\n",
+        "  print(template.format(epoch + 1, average_train_loss, train_accuracy.result() * 100))\n",
         "  train_accuracy.reset_states()"
       ]
     },
@@ -617,7 +608,7 @@
       },
       "source": [
         "### Iterating inside a tf.function\n",
-        "You can also iterate over the entire input `train_dist_dataset` inside a tf.function using the `for x in ...` construct or by creating iterators like we did above. The example below demonstrates wrapping one epoch of training in a tf.function and iterating over `train_dist_dataset` inside the function."
+        "You can also iterate over the entire input `train_dist_dataset` inside a `tf.function` using the `for x in ...` construct or by creating iterators like you did above. The example below demonstrates wrapping one epoch of training with a `@tf.function` decorator and iterating over `train_dist_dataset` inside the function."
       ]
     },
     {
@@ -643,7 +634,7 @@
         "  train_loss = distributed_train_epoch(train_dist_dataset)\n",
         "\n",
         "  template = (\"Epoch {}, Loss: {}, Accuracy: {}\")\n",
-        "  print (template.format(epoch+1, train_loss, train_accuracy.result()*100))\n",
+        "  print(template.format(epoch + 1, train_loss, train_accuracy.result() * 100))\n",
         "\n",
         "  train_accuracy.reset_states()"
       ]
@@ -658,7 +649,7 @@
         "\n",
         "Note: As a general rule, you should use `tf.keras.Metrics` to track per-sample values and avoid values that have been aggregated within a replica.\n",
         "\n",
-        "We do *not* recommend using `tf.metrics.Mean` to track the training loss across different replicas, because of the loss scaling computation that is carried out.\n",
+        "Because of the loss scaling computation that is carried out, it's not recommended to use `tf.metrics.Mean` to track the training loss across different replicas.\n",
         "\n",
         "For example, if you run a training job with the following characteristics:\n",
         "* Two replicas\n",
@@ -699,7 +690,8 @@
         "## Next steps\n",
         "\n",
         "*   Try out the new `tf.distribute.Strategy` API on your models.\n",
-        "*   Visit the [Performance section](../../guide/function.ipynb) in the guide to learn more about other strategies and [tools](../../guide/profiler.md) you can use to optimize the performance of your TensorFlow models."
+        "*   Visit the [Better performance with tf.function](../../guide/function.ipynb) and [TensorFlow Profiler](../../guide/profiler.md) guide to learn more about tools to optimize the performance of your TensorFlow models.\n",
+        "*   The [Distributed training in TensorFlow](../../guide/distributed_training.ipynb) guide provides an overview of the available distribution strategies."
       ]
     }
   ],