Handle Model.losses in custom training tutorial.

arnoegw · copybara-github · commit d09c37be0556 · 2022-12-05T23:43:32.000-08:00
PiperOrigin-RevId: 493217839
diff --git a/site/en/tutorials/distribute/custom_training.ipynb b/site/en/tutorials/distribute/custom_training.ipynb
@@ -306,17 +306,16 @@
       "source": [
         "*How to do this in TensorFlow?*\n",
         "\n",
+        "* Loss reduction and scaling is done automatically in Keras `Model.compile` and `Model.fit`\n",
+        "\n",
         "* If you're writing a custom training loop, as in this tutorial, you should sum the per example losses and divide the sum by the `GLOBAL_BATCH_SIZE`: \n",
         "`scale_loss = tf.reduce_sum(loss) * (1. / GLOBAL_BATCH_SIZE)`\n",
         "or you can use `tf.nn.compute_average_loss` which takes the per example loss,\n",
         "optional sample weights, and `GLOBAL_BATCH_SIZE` as arguments and returns the scaled loss.\n",
         "\n",
-        "* If you are using regularization losses in your model then you need to scale\n",
-        "the loss value by the number of replicas. You can do this by using the `tf.nn.scale_regularization_loss` function.\n",
-        "\n",
-        "* Using `tf.reduce_mean` is not recommended. Doing so divides the loss by actual per replica batch size which may vary step to step.\n",
+        "* If you're writing a custom training loop for a model with a non-empty list of `Model.losses` (e.g., weight regularizers), you should sum them up and divide the sum by the number of replicas. You can do this by using the `tf.nn.scale_regularization_loss` function.\n",
         "\n",
-        "* This reduction and scaling is done automatically in Keras `Model.compile` and `Model.fit`\n",
+        "* Be careful about batches that are shorter than the `GLOBAL_BATCH_SIZE`, if your training data allows them: Dividing the prediction loss by `GLOBAL_BATCH_SIZE` (instead of using `tf.reduce_mean` over the actual batch size) avoids overweighting examples from short batches. However, this does not apply to regularization losses.\n",
         "\n",
         "* If using `tf.keras.losses` classes (as in the example below), the loss reduction needs to be explicitly specified to be one of `NONE` or `SUM`. `AUTO` and `SUM_OVER_BATCH_SIZE`  are disallowed when used with `tf.distribute.Strategy`. `AUTO` is disallowed because the user should explicitly think about what reduction they want to make sure it is correct in the distributed case. `SUM_OVER_BATCH_SIZE` is disallowed because currently it would only divide by per replica batch size, and leave the dividing by number of replicas to the user, which might be easy to miss. So, instead, you need to do the reduction yourself explicitly.\n",
         "* If `labels` is multi-dimensional, then average the `per_example_loss` across the number of elements in each sample. For example, if the shape of `predictions` is `(batch_size, H, W, n_classes)` and `labels` is `(batch_size, H, W)`, you will need to update `per_example_loss` like: `per_example_loss /= tf.cast(tf.reduce_prod(tf.shape(labels)[1:]), tf.float32)`\n",
@@ -347,9 +346,13 @@
         "  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(\n",
         "      from_logits=True,\n",
         "      reduction=tf.keras.losses.Reduction.NONE)\n",
-        "  def compute_loss(labels, predictions):\n",
+        "  def compute_loss(labels, predictions, model_losses):\n",
         "    per_example_loss = loss_object(labels, predictions)\n",
-        "    return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)"
+        "    loss = tf.nn.compute_average_loss(per_example_loss,\n",
+        "                                      global_batch_size=GLOBAL_BATCH_SIZE)\n",
+        "    if model_losses:\n",
+        "      loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))\n",
+        "    return loss"
       ]
     },
     {
@@ -419,7 +422,7 @@
         "\n",
         "  with tf.GradientTape() as tape:\n",
         "    predictions = model(images, training=True)\n",
-        "    loss = compute_loss(labels, predictions)\n",
+        "    loss = compute_loss(labels, predictions, model.losses)\n",
         "\n",
         "  gradients = tape.gradient(loss, model.trainable_variables)\n",
         "  optimizer.apply_gradients(zip(gradients, model.trainable_variables))\n",