Add a feature_column migration section for weighted categorical inputs

tensorflower-gardener · copybara-github · commit bff734bc88b9 · 2021-10-26T12:50:19.000-07:00
PiperOrigin-RevId: 405712895
diff --git a/site/en/guide/migrate/migrating_feature_columns.ipynb b/site/en/guide/migrate/migrating_feature_columns.ipynb
@@ -463,6 +463,124 @@
         "embedding(string_lookup_layer(['small', 'medium', 'large']))"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "UwqvADV6HRdC"
+      },
+      "source": [
+        "## Summing weighted categorical data\n",
+        "\n",
+        "In some cases, you need to deal with categorical data where each occurance of a category comes with an associated weight. In feature columns, this is handled with `tf.feature_column.weighted_categorical_column`. When paired with an `indicator_column`, this has the effect of summing weights per category."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "02HqjPLMRxWn"
+      },
+      "outputs": [],
+      "source": [
+        "ids = tf.constant([[5, 11, 5, 17, 17]])\n",
+        "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
+        "\n",
+        "categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
+        "    'ids', num_buckets=20)\n",
+        "weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n",
+        "    categorical_col, 'weights')\n",
+        "indicator_col = tf1.feature_column.indicator_column(weighted_categorical_col)\n",
+        "call_feature_columns(indicator_col, {'ids': ids, 'weights': weights})"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "98jaq7Q3S9aG"
+      },
+      "source": [
+        "In Keras, this can be done by passing a `count_weights` input to `tf.keras.layers.CategoryEncoding` with `output_mode='count'`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "JsoYUUgRS7hu"
+      },
+      "outputs": [],
+      "source": [
+        "ids = tf.constant([[5, 11, 5, 17, 17]])\n",
+        "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
+        "\n",
+        "# Using sparse output is more efficient when `num_tokens` is large.\n",
+        "count_layer = tf.keras.layers.CategoryEncoding(\n",
+        "    num_tokens=20, output_mode='count', sparse=True)\n",
+        "tf.sparse.to_dense(count_layer(ids, count_weights=weights))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "gBJxb6y2GasI"
+      },
+      "source": [
+        "## Embedding weighted categorical data\n",
+        "\n",
+        "You might alternately want to embed weighted categorical inputs. In feature columns, the `embedding_column` contains a `combiner` argument. If any sample\n",
+        "contains multiple entries for a category, they will be combined according to the argument setting (by default `'mean'`)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "AjOt1wgmT5mM"
+      },
+      "outputs": [],
+      "source": [
+        "ids = tf.constant([[5, 11, 5, 17, 17]])\n",
+        "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
+        "\n",
+        "categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
+        "    'ids', num_buckets=20)\n",
+        "weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n",
+        "    categorical_col, 'weights')\n",
+        "embedding_col = tf1.feature_column.embedding_column(\n",
+        "    weighted_categorical_col, 4, combiner='mean')\n",
+        "call_feature_columns(embedding_col, {'ids': ids, 'weights': weights})"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fd6eluARXndC"
+      },
+      "source": [
+        "In Keras, there is no `combiner` option to `tf.keras.layers.Embedding`, but you can acheive the same effect with `tf.keras.layers.Dense`. The `embedding_column` above is simply linearly combining embedding vectors according to category weight. Though not obvious at first, it is exactly equivalent to representing your categorical inputs as a sparse weight vector of size `(num_tokens)`, and mutiplying them by a `Dense` kernel of shape `(embedding_size, num_tokens)`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Y-vZvPyiYilE"
+      },
+      "outputs": [],
+      "source": [
+        "ids = tf.constant([[5, 11, 5, 17, 17]])\n",
+        "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
+        "\n",
+        "# For `combiner='mean'`, normalize your weights to sum to 1. Removing this line\n",
+        "# would be eqivalent to an `embedding_column` with `combiner='sum'`.\n",
+        "weights = weights / tf.reduce_sum(weights, axis=-1, keepdims=True)\n",
+        "\n",
+        "count_layer = tf.keras.layers.CategoryEncoding(\n",
+        "    num_tokens=20, output_mode='count', sparse=True)\n",
+        "embedding_layer = tf.keras.layers.Dense(4, use_bias=False)\n",
+        "embedding_layer(count_layer(ids, count_weights=weights))"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -843,7 +961,9 @@
         "\n",
         "\\* `output_mode` can be passed to `layers.CategoryEncoding`, `layers.StringLookup`, `layers.IntegerLookup`, and `layers.TextVectorization`.\n",
         "\n",
-        "† `layers.TextVectorization` can handle freeform text input directly (e.g. entire sentences or paragraphs). This is not one-to-one replacement for categorical sequence handling in TF1, but may offer a convinient replacement for ad-hoc text preprocessing."
+        "† `layers.TextVectorization` can handle freeform text input directly (e.g. entire sentences or paragraphs). This is not one-to-one replacement for categorical sequence handling in TF1, but may offer a convinient replacement for ad-hoc text preprocessing.\n",
+        "\n",
+        "Note: Linear estimators, such as `tf.estimator.LinearClassifier`, can handle direct categorical input (integer indices) without an `embedding_column` or `indicator_column`. However, integer indices cannot be passed directly to `tf.keras.layers.Dense` or `tf.keras.experimental.LinearModel`. These inputs should be first encoded with `tf.layers.CategoryEncoding` with `output_mode='count'` (and `sparse=True` if the category sizes are large) before calling into `Dense` or `LinearModel`."
       ]
     },
     {
@@ -863,7 +983,6 @@
     "colab": {
       "collapsed_sections": [],
       "name": "migrating_feature_columns.ipynb",
-      "provenance": [],
       "toc_visible": true
     },
     "kernelspec": {