Review and update public doc "Distributed training with Keras"

hubingallin · copybara-github · commit 10313cbebd05 · 2022-03-31T14:30:24.000-07:00
PiperOrigin-RevId: 438662329
diff --git a/site/en/tutorials/distribute/keras.ipynb b/site/en/tutorials/distribute/keras.ipynb
@@ -76,7 +76,7 @@
         "\n",
         "You will use the `tf.keras` APIs to build the model and `Model.fit` for training it. (To learn about distributed training with a custom training loop and the `MirroredStrategy`, check out [this tutorial](custom_training.ipynb).)\n",
         "\n",
-        "`MirroredStrategy` trains your model on multiple GPUs on a single machine. For _synchronous training on many GPUs on multiple workers_, use the `tf.distribute.MultiWorkerMirroredStrategy` [with the Keras Model.fit](multi_worker_with_keras.ipynb) or [a custom training loop](multi_worker_with_ctl.ipynb). For other options, refer to the [Distributed training guide](../../guide/distributed_training.ipynb).\n",
+        "`MirroredStrategy` trains your model on multiple GPUs on a single machine. For _synchronous training on many GPUs on multiple workers_, use the `tf.distribute.MultiWorkerMirroredStrategy` with the [Keras Model.fit](multi_worker_with_keras.ipynb) or [a custom training loop](multi_worker_with_ctl.ipynb). For other options, refer to the [Distributed training guide](../../guide/distributed_training.ipynb).\n",
         "\n",
         "To learn about various other strategies, there is the [Distributed training with TensorFlow](../../guide/distributed_training.ipynb) guide."
       ]
@@ -289,7 +289,7 @@
         "id": "1BnQYQTpB3YA"
       },
       "source": [
-        "Create and compile the Keras model in the context of `Strategy.scope`:"
+        "Within the context of `Strategy.scope`, create and compile the model using the Keras API:"
       ]
     },
     {
@@ -329,13 +329,16 @@
         "id": "YOXO5nvvK3US"
       },
       "source": [
-        "Define the following `tf.keras.callbacks`:\n",
+        "Define the following [Keras Callbacks](https://www.tensorflow.org/guide/keras/train_and_evaluate):\n",
         "\n",
         "- `tf.keras.callbacks.TensorBoard`: writes a log for TensorBoard, which allows you to visualize the graphs.\n",
         "- `tf.keras.callbacks.ModelCheckpoint`: saves the model at a certain frequency, such as after every epoch.\n",
+        "- `tf.keras.callbacks.BackupAndRestore`: provides the fault tolerance functionality by backing up the model and current epoch number. Learn more in the _Fault tolerance_ section of the [Multi-worker training with Keras](multi_worker_with_keras.ipynb) tutorial.\n",
         "- `tf.keras.callbacks.LearningRateScheduler`: schedules the learning rate to change after, for example, every epoch/batch.\n",
         "\n",
-        "For illustrative purposes, add a custom callback called `PrintLR` to display the *learning rate* in the notebook."
+        "For illustrative purposes, add a [custom callback](https://www.tensorflow.org/guide/keras/custom_callback) called `PrintLR` to display the *learning rate* in the notebook.\n",
+        "\n",
+        "**Note:** Use the `BackupAndRestore` callback instead of `ModelCheckpoint` as the main mechanism to restore the training state upon a restart from a job failure. Since `BackupAndRestore` only supports eager mode, in graph mode consider using `ModelCheckpoint`."
       ]
     },
     {
@@ -382,8 +385,8 @@
         "# Define a callback for printing the learning rate at the end of each epoch.\n",
         "class PrintLR(tf.keras.callbacks.Callback):\n",
         "  def on_epoch_end(self, epoch, logs=None):\n",
-        "    print('\\nLearning rate for epoch {} is {}'.format(epoch + 1,\n",
-        "                                                      model.optimizer.lr.numpy()))"
+        "    print('\\nLearning rate for epoch {} is {}'.format(",
+        "        epoch + 1, model.optimizer.lr.numpy()))"
       ]
     },
     {
@@ -419,7 +422,7 @@
         "id": "6EophnOAB3YD"
       },
       "source": [
-        "Now, train the model in the usual way by calling `Model.fit` on the model and passing in the dataset created at the beginning of the tutorial. This step is the same whether you are distributing the training or not."
+        "Now, train the model in the usual way by calling Keras `Model.fit` on the model and passing in the dataset created at the beginning of the tutorial. This step is the same whether you are distributing the training or not."
       ]
     },
     {
@@ -535,7 +538,7 @@
         "id": "Xa87y_A0vRma"
       },
       "source": [
-        "Export the graph and the variables to the platform-agnostic SavedModel format using `Model.save`. After your model is saved, you can load it with or without the `Strategy.scope`."
+        "Export the graph and the variables to the platform-agnostic SavedModel format using Keras `Model.save`. After your model is saved, you can load it with or without the `Strategy.scope`."
       ]
     },
     {
@@ -626,7 +629,7 @@
         "\n",
         "More examples that use different distribution strategies with the Keras `Model.fit` API:\n",
         "\n",
-        "1. The [Solve GLUE tasks using BERT on TPU](https://www.tensorflow.org/text/tutorials/bert_glue) tutorial uses `tf.distribute.MirroredStrategy` for training on GPUs and `tf.distribute.TPUStrategy`—on TPUs.\n",
+        "1. The [Solve GLUE tasks using BERT on TPU](https://www.tensorflow.org/text/tutorials/bert_glue) tutorial uses `tf.distribute.MirroredStrategy` for training on GPUs and `tf.distribute.TPUStrategy` on TPUs.\n",
         "1. The [Save and load a model using a distribution strategy](save_and_load.ipynb) tutorial demonstates how to use the SavedModel APIs with `tf.distribute.Strategy`.\n",
         "1. The [official TensorFlow models](https://github.com/tensorflow/models/tree/master/official) can be configured to run multiple distribution strategies.\n",
         "\n",