Update documentation for data preprocessing and lookup table with tf.distribute.

w-xinyi · copybara-github · commit f292e1ce8420 · 2022-02-21T18:38:32.000-08:00
PiperOrigin-RevId: 430111286
diff --git a/site/en/tutorials/distribute/input.ipynb b/site/en/tutorials/distribute/input.ipynb
@@ -626,6 +626,245 @@
       "metadata": {
         "id": "-OAa6svUzuWm"
       },
+      "source": [
+        "## Data Preprocessing"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pSMrs3kJQexW"
+      },
+      "source": [
+        "So far, we have discussed how to distribute a `tf.data.Dataset`. Yet before the data is ready for the model, we have the crucial step of preprocessing the data, e.g., cleansing, transforming, augmenting. Two sets of those handy tools are:\n",
+        "\n",
+        "*   [Keras preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers): a set of Keras layers that allow developers to build Keras-native input processing pipelines. Some Keras preprocessing layers contain non-trainable states, which can be set on initialization or [\"adapted\"](https://www.tensorflow.org/guide/keras/preprocessing_layers#the_adapt_method). When distributing stateful preprocessing layers, we want the states replicated to all workers. To use these layers, you can either make them part of the model or apply them to the datasets.\n",
+        "\n",
+        "*   [TensorFlow Transform (tf.Transform)](https://www.tensorflow.org/tfx/transform/get_started): a library for TensorFlow that allows you to define both instance-level and full-pass data transformation through data preprocessing pipelines. Tensorflow Transform has two phases. The first is the Analyze phase, where the raw training data is analyzed in a full-pass process to compute the statistics needed for the transformations, and the transformation logic is generated as instance-level operations. The second is the Transform phase, where the raw training data is transformed in an instance-level process.\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Pd4aUCFdVlZ1"
+      },
+      "source": [
+        "### Keras preprocessing layers vs. Tensorflow Transform \n",
+        "\n",
+        "Both Tensorflow Transform and Keras preprocessing layers provide a way to split out preprocessing during training and bundle preprocessing with a model during inference, reducing train/serve skew.\n",
+        "\n",
+        "Tensorflow Transform, deeply integrated with [TFX](https://www.tensorflow.org/tfx), provides a scalable map-reduce solution to analyzing and transforming datasets of any size in a job separate from the training pipeline. If you need to run an analysis on a dataset that cannot fit on a single machine, Tensorflow Transform should be your first choice.\n",
+        "\n",
+        "Keras preprocessing layers are more geared towards preprocessing applied during training, after reading data from disk. They fit seamlessly with model development in the Keras library. They support analysis of a smaller dataset via [`adapt`](https://www.tensorflow.org/guide/keras/preprocessing_layers#the_adapt_method) and supports use cases like image data augmentation, where each pass over the input dataset will yield different examples for training.\n",
+        "\n",
+        "The two libraries can also be mixed, where Tensorflow Transform is used for analysis and static transformations of input data, and Keras preprocessing layers are used for train-time transformations (e.g., one-hot encoding or data augmentation).\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MReKhhZpHUpj"
+      },
+      "source": [
+        "### Best Practice with tf.distribute\n",
+        "\n",
+        "Working with both tools involves initializing the transformation logic to apply to data, which might create Tensorflow resources. We would want these resources or states replicated to all workers to save inter-workers or worker-coordinator communication. To do so, we recommend you create Keras preprocessing layers, `tft.TFTransformOutput.transform_features_layer`, or `tft.TransformFeaturesLayer` under `tf.distribute.Strategy.scope()`, just like you would for any other Keras layers.\n",
+        "\n",
+        "We will demonstrate examples with the high-level Keras `Model.fit` API and the custom training loop separately."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rwEGMWuoX7kJ"
+      },
+      "source": [
+        "#### Extra notes for Keras preprocessing layers users:\n",
+        "\n",
+        "**Preprocessing layers and large vocabularies**\n",
+        "\n",
+        "When dealing with large vocabularies (over one gigabyte) in a multi-worker setting (i.e., `tf.distribute.MultiWorkerMirroredStrategy`, `tf.distribute.experimental.ParameterServerStrategy`, `tf.distribute.TPUStrategy`), we recommend saving the vocabulary to a static file accessible from all workers (e.g., with Cloud Storage). This will reduce the time spent replicating the vocabulary to all workers during training.\n",
+        "\n",
+        "**Preprocessing in data pipeline vs. in model**\n",
+        "\n",
+        "While Keras preprocessing layers can be applied either as part of the model or directly to a `tf.data.Dataset`,  each of the options come with their edge:\n",
+        "\n",
+        "* Applying in the model makes your model portable, and it helps reduce the training/serving skew. ([more details](https://www.tensorflow.org/guide/keras/preprocessing_layers#benefits_of_doing_preprocessing_inside_the_model_at_inference_time))\n",
+        "* Applying in the `tf.data` pipeline allows prefetching or offloading to the CPU, which generally gives better performance when using accelerators.\n",
+        "\n",
+        "When running on TPU, users should almost always place preprocessing layers in the `tf.data` pipeline, as not all layers support TPU, and string ops do not execute on TPU. (The two exceptions are `Normalization` and `Rescaling`, which run fine on TPU and are commonly used as the first layer is an image model.)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "hNCYZ9L-BD2R"
+      },
+      "source": [
+        "### Model.fit"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "NhRB2Xe8B6bX"
+      },
+      "source": [
+        "Users of Keras `Model.fit` do not need to distribute data with `tf.distribute.Strategy.experimental_distribute_dataset` nor `tf.distribute.Strategy.distribute_datasets_from_function` themselves. Check out the [Working with Preprocessing Layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) guide and [Distributed Training with Keras](https://www.tensorflow.org/tutorials/distribute/keras) guide for details. A shortened example may look as below:\n",
+        "\n",
+        "```\n",
+        "strategy = tf.distribute.MirroredStrategy()\n",
+        "with strategy.scope():\n",
+        "  # Create the layer(s) under scope.\n",
+        "  integer_preprocessing_layer = tf.keras.layers.IntegerLookup(vocabulary=FILE_PATH)\n",
+        "  model = ...\n",
+        "  model.compile(...)\n",
+        "dataset = dataset.map(lambda x, y: (integer_preprocessing_layer(x), y))\n",
+        "model.fit(dataset)\n",
+        "```\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "3zL2vzJ-G0yg"
+      },
+      "source": [
+        "Users of `tf.distribute.experimental.ParameterServerStrategy` with the `Model.fit` API need to use a `tf.keras.utils.experimental.DatasetCreator` as the input. (See the [Parameter Server Training](https://www.tensorflow.org/tutorials/distribute/parameter_server_training#parameter_server_training_with_modelfit_api) guide for more)\n",
+        "\n",
+        "```\n",
+        "strategy = tf.distribute.experimental.ParameterServerStrategy(\n",
+        "    cluster_resolver,\n",
+        "    variable_partitioner=variable_partitioner)\n",
+        "\n",
+        "with strategy.scope():\n",
+        "  preprocessing_layer = tf.keras.layers.StringLookup(vocabulary=FILE_PATH)\n",
+        "  model = ...\n",
+        "  model.compile(...)\n",
+        "\n",
+        "def dataset_fn(input_context):\n",
+        "  ...\n",
+        "  dataset = dataset.map(preprocessing_layer)\n",
+        "  ...\n",
+        "  return dataset\n",
+        "\n",
+        "dataset_creator = tf.keras.utils.experimental.DatasetCreator(dataset_fn)\n",
+        "model.fit(dataset_creator, epochs=5, steps_per_epoch=20, callbacks=callbacks)\n",
+        "\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "imZLQUOYBJyW"
+      },
+      "source": [
+        "### Custom Training Loop"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "r2PX1QH_OwU3"
+      },
+      "source": [
+        "When writing a [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training), you will distribute your data with either the `tf.distribute.Strategy.experimental_distribute_dataset` API or the `tf.distribute.Strategy.distribute_datasets_from_function` API. If you distribute your dataset through `tf.distribute.Strategy.experimental_distribute_dataset`, applying these preprocessing APIs in your data pipeline will lead the resources automatically co-located with the data pipeline to avoid remote resource access. Thus we will only demonstrate examples with `tf.distribute.Strategy.distribute_datasets_from_function`, in which case it is crucial to place initialization of these APIs under `strategy.scope()` for efficiency:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "wJS1UmcWQeab"
+      },
+      "outputs": [],
+      "source": [
+        "strategy = tf.distribute.MirroredStrategy()\n",
+        "vocab = [\"a\", \"b\", \"c\", \"d\", \"f\"]\n",
+        "\n",
+        "with strategy.scope():\n",
+        "  # Create the layer(s) under scope.\n",
+        "  layer = tf.keras.layers.StringLookup(vocabulary=vocab)\n",
+        "\n",
+        "def dataset_fn(input_context):\n",
+        "  # a tf.data.Dataset\n",
+        "  dataset = tf.data.Dataset.from_tensor_slices([\"a\", \"c\", \"e\"]).repeat()\n",
+        "\n",
+        "  # Custom your batching, sharding, prefetching, etc.\n",
+        "  global_batch_size = 4\n",
+        "  batch_size = input_context.get_per_replica_batch_size(global_batch_size)\n",
+        "  dataset = dataset.batch(batch_size)\n",
+        "  dataset = dataset.shard(\n",
+        "      input_context.num_input_pipelines,\n",
+        "      input_context.input_pipeline_id)\n",
+        "\n",
+        "  # Apply the preprocessing layer(s) to the tf.data.Dataset\n",
+        "  def preprocess_with_kpl(input):\n",
+        "    return layer(input)\n",
+        "\n",
+        "  processed_ds = dataset.map(preprocess_with_kpl)\n",
+        "  return processed_ds\n",
+        "\n",
+        "distributed_dataset = strategy.distribute_datasets_from_function(dataset_fn)\n",
+        "\n",
+        "# Print out a few example batches.\n",
+        "distributed_dataset_iterator = iter(distributed_dataset)\n",
+        "for _ in range(3):\n",
+        "  print(next(distributed_dataset_iterator))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "PVl1cblWQy8b"
+      },
+      "source": [
+        "Note that if you are training with `tf.distribute.experimental.ParameterServerStrategy`, you'll also call `tf.distribute.experimental.coordinator.ClusterCoordinator.create_per_worker_dataset`\n",
+        "\n",
+        "```\n",
+        "@tf.function\n",
+        "def per_worker_dataset_fn():\n",
+        "  return strategy.distribute_datasets_from_function(dataset_fn)\n",
+        "\n",
+        "per_worker_dataset = coordinator.create_per_worker_dataset(per_worker_dataset_fn)\n",
+        "per_worker_iterator = iter(per_worker_dataset)\n",
+        "```\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Ol7SmPID1dAt"
+      },
+      "source": [
+        "For Tensorflow Transform, as mentioned above, the Analyze stage is done separately from training and thus omitted here. See the [tutorial](https://www.tensorflow.org/tfx/tutorials/transform/census) for a detailed how-to. Usually, this stage includes creating a `tf.Transform` preprocessing function and transforming the data in an [Apache Beam](https://beam.apache.org/) pipeline with this preprocessing function. At the end of the Analyze stage, the output can be exported as a TensorFlow graph which you can use for both training and serving. Our example covers only the training pipeline part:\n",
+        "\n",
+        "```\n",
+        "with strategy.scope():\n",
+        "  # working_dir contains the tf.Transform output.\n",
+        "  tf_transform_output = tft.TFTransformOutput(working_dir)\n",
+        "  # Loading from working_dir to create a Keras layer for applying the tf.Transform output to data\n",
+        "  tft_layer = tf_transform_output.transform_features_layer()\n",
+        "  ...\n",
+        "\n",
+        "def dataset_fn(input_context):\n",
+        "  ...\n",
+        "  dataset.map(tft_layer, num_parallel_calls=tf.data.AUTOTUNE)\n",
+        "  ...\n",
+        "  return dataset\n",
+        "\n",
+        "distributed_dataset = strategy.distribute_datasets_from_function(dataset_fn)\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "3_IQxRXxQWof"
+      },
       "source": [
         "## Partial Batches"
       ]
@@ -827,6 +1066,7 @@
     "colab": {
       "collapsed_sections": [],
       "name": "input.ipynb",
+      "provenance": [],
       "toc_visible": true
     },
     "kernelspec": {