Update multi-worker Keras tutorial: add more details around sharding and steps_per_epoch; add a section for evaluation.

Yuefeng Zhou · copybara-github · commit 0d6bb268b0a2 · 2020-06-01T22:03:52.000-07:00
PiperOrigin-RevId: 314266301
diff --git a/site/en/tutorials/distribute/multi_worker_with_keras.ipynb b/site/en/tutorials/distribute/multi_worker_with_keras.ipynb
@@ -301,7 +301,7 @@
         "\n",
         "Note: In this Colab, the following code can run with expected result, but however this is effectively single-worker training since `TF_CONFIG` is not set. Once you set `TF_CONFIG` in your own example, you should expect speed-up with training on multiple machines.\n",
         "\n",
-        "Note: Always pass in `steps_per_epoch` argument to `model.fit()` since `MultiWorkerMirroredStrategy` does not support last partial batch handling. When using `steps_per_epoch`, `model.fit()` does not create a new iterator from the input every epoch, but continues from wherever the last epoch ended. Hence, make sure to call `.repeat()` on the dataset so it has an adequate number of examples for N epochs."
+        "Note: Always pass in `steps_per_epoch` argument to `model.fit()` since `MultiWorkerMirroredStrategy` does not support last partial batch handling. When using `steps_per_epoch`, `model.fit()` does not create a new iterator from the input every epoch, but continues from wherever the last epoch ended. Hence, make sure to call `.repeat()` on the dataset so it has an adequate number of examples for N epochs. If your dataset is not a repeated dataset, the `steps_per_epoch` should be set based on the amount of training data on each worker so that all workers would perform the same number of steps of training or evaluation, which is required by allreduce. In particular, if the sharding is not balanced, `steps_per_epoch` should be set to the size of the smallest sharded devided by the per-worker batch size."
       ]
     },
     {
@@ -340,7 +340,7 @@
       },
       "source": [
         "### Dataset sharding and batch size\n",
-        "In multi-worker training, sharding data into multiple parts is needed to ensure convergence and performance. However, note that in above code snippet, the datasets are directly sent to `model.fit()` without needing to shard; this is because `tf.distribute.Strategy` API takes care of the dataset sharding automatically in multi-worker trainings.\n",
+        "In multi-worker training, sharding data into multiple parts is needed to ensure convergence and performance. However, note that in above code snippet, the datasets are directly sent to `model.fit()` without needing to shard; this is because `tf.distribute.Strategy` API takes care of the dataset sharding automatically in multi-worker trainings. It shards the dataset at the file level which may create skewed shards. In extreme cases where there is only one file, only the first shard (i.e. worker) will get training or evaluation data and as a result all workers will get errors.\n",
         "\n",
         "If you prefer manual sharding for your training, automatic sharding can be turned off via `tf.data.experimental.DistributeOptions` api. Concretely,"
       ]
@@ -370,6 +370,31 @@
         "Another thing to notice is the batch size for the `datasets`. In the code snippet above, we use `global_batch_size = per_worker_batch_size * num_workers`, which is `num_workers` times as large as the case it was for single worker, because the effective per worker batch size is the global batch size (the parameter passed in `tf.data.Dataset.batch()`) divided by the number of workers, and with this change we are keeping the per worker batch size same as before."
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "gmqvlh5LhAoU"
+      },
+      "source": [
+        "### Evaluation\n",
+        "\n",
+        "If you pass `validation_data` into `model.fit`, it will alternate between training and evaluation for each epoch. The evaluation taking `validation_data` is distributed across the same set of workers and the evaluation results are aggregated and available for all workers. Similar to training, the validation dataset is automatically sharded at the file level. You need to set a global batch size in the validation dataset and set `validation_steps`. A repeated dataset is also recommended for evaluation.\n",
+        "\n",
+        "Alternatively, you can also create another task that periodically reads checkpoints and runs the evaluation. This is what Estimator does. But this is not a recommended way to perform evaluation and thus its details are omitted."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "CsQsRfBxKMcw"
+      },
+      "source": [
+        "### Prediction\n",
+        "Currently `model.predict` doesn't work with `MultiWorkerMirroredStrategy.`"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {