Skip to content

Commit bff734b

Browse files
Add a feature_column migration section for weighted categorical inputs
PiperOrigin-RevId: 405712895
1 parent 87afc30 commit bff734b

File tree

1 file changed

+121
-2
lines changed

1 file changed

+121
-2
lines changed

site/en/guide/migrate/migrating_feature_columns.ipynb

Lines changed: 121 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -463,6 +463,124 @@
463463
"embedding(string_lookup_layer(['small', 'medium', 'large']))"
464464
]
465465
},
466+
{
467+
"cell_type": "markdown",
468+
"metadata": {
469+
"id": "UwqvADV6HRdC"
470+
},
471+
"source": [
472+
"## Summing weighted categorical data\n",
473+
"\n",
474+
"In some cases, you need to deal with categorical data where each occurance of a category comes with an associated weight. In feature columns, this is handled with `tf.feature_column.weighted_categorical_column`. When paired with an `indicator_column`, this has the effect of summing weights per category."
475+
]
476+
},
477+
{
478+
"cell_type": "code",
479+
"execution_count": null,
480+
"metadata": {
481+
"id": "02HqjPLMRxWn"
482+
},
483+
"outputs": [],
484+
"source": [
485+
"ids = tf.constant([[5, 11, 5, 17, 17]])\n",
486+
"weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
487+
"\n",
488+
"categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
489+
" 'ids', num_buckets=20)\n",
490+
"weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n",
491+
" categorical_col, 'weights')\n",
492+
"indicator_col = tf1.feature_column.indicator_column(weighted_categorical_col)\n",
493+
"call_feature_columns(indicator_col, {'ids': ids, 'weights': weights})"
494+
]
495+
},
496+
{
497+
"cell_type": "markdown",
498+
"metadata": {
499+
"id": "98jaq7Q3S9aG"
500+
},
501+
"source": [
502+
"In Keras, this can be done by passing a `count_weights` input to `tf.keras.layers.CategoryEncoding` with `output_mode='count'`."
503+
]
504+
},
505+
{
506+
"cell_type": "code",
507+
"execution_count": null,
508+
"metadata": {
509+
"id": "JsoYUUgRS7hu"
510+
},
511+
"outputs": [],
512+
"source": [
513+
"ids = tf.constant([[5, 11, 5, 17, 17]])\n",
514+
"weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
515+
"\n",
516+
"# Using sparse output is more efficient when `num_tokens` is large.\n",
517+
"count_layer = tf.keras.layers.CategoryEncoding(\n",
518+
" num_tokens=20, output_mode='count', sparse=True)\n",
519+
"tf.sparse.to_dense(count_layer(ids, count_weights=weights))"
520+
]
521+
},
522+
{
523+
"cell_type": "markdown",
524+
"metadata": {
525+
"id": "gBJxb6y2GasI"
526+
},
527+
"source": [
528+
"## Embedding weighted categorical data\n",
529+
"\n",
530+
"You might alternately want to embed weighted categorical inputs. In feature columns, the `embedding_column` contains a `combiner` argument. If any sample\n",
531+
"contains multiple entries for a category, they will be combined according to the argument setting (by default `'mean'`)."
532+
]
533+
},
534+
{
535+
"cell_type": "code",
536+
"execution_count": null,
537+
"metadata": {
538+
"id": "AjOt1wgmT5mM"
539+
},
540+
"outputs": [],
541+
"source": [
542+
"ids = tf.constant([[5, 11, 5, 17, 17]])\n",
543+
"weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
544+
"\n",
545+
"categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
546+
" 'ids', num_buckets=20)\n",
547+
"weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n",
548+
" categorical_col, 'weights')\n",
549+
"embedding_col = tf1.feature_column.embedding_column(\n",
550+
" weighted_categorical_col, 4, combiner='mean')\n",
551+
"call_feature_columns(embedding_col, {'ids': ids, 'weights': weights})"
552+
]
553+
},
554+
{
555+
"cell_type": "markdown",
556+
"metadata": {
557+
"id": "fd6eluARXndC"
558+
},
559+
"source": [
560+
"In Keras, there is no `combiner` option to `tf.keras.layers.Embedding`, but you can acheive the same effect with `tf.keras.layers.Dense`. The `embedding_column` above is simply linearly combining embedding vectors according to category weight. Though not obvious at first, it is exactly equivalent to representing your categorical inputs as a sparse weight vector of size `(num_tokens)`, and mutiplying them by a `Dense` kernel of shape `(embedding_size, num_tokens)`."
561+
]
562+
},
563+
{
564+
"cell_type": "code",
565+
"execution_count": null,
566+
"metadata": {
567+
"id": "Y-vZvPyiYilE"
568+
},
569+
"outputs": [],
570+
"source": [
571+
"ids = tf.constant([[5, 11, 5, 17, 17]])\n",
572+
"weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
573+
"\n",
574+
"# For `combiner='mean'`, normalize your weights to sum to 1. Removing this line\n",
575+
"# would be eqivalent to an `embedding_column` with `combiner='sum'`.\n",
576+
"weights = weights / tf.reduce_sum(weights, axis=-1, keepdims=True)\n",
577+
"\n",
578+
"count_layer = tf.keras.layers.CategoryEncoding(\n",
579+
" num_tokens=20, output_mode='count', sparse=True)\n",
580+
"embedding_layer = tf.keras.layers.Dense(4, use_bias=False)\n",
581+
"embedding_layer(count_layer(ids, count_weights=weights))"
582+
]
583+
},
466584
{
467585
"cell_type": "markdown",
468586
"metadata": {
@@ -843,7 +961,9 @@
843961
"\n",
844962
"\\* `output_mode` can be passed to `layers.CategoryEncoding`, `layers.StringLookup`, `layers.IntegerLookup`, and `layers.TextVectorization`.\n",
845963
"\n",
846-
"† `layers.TextVectorization` can handle freeform text input directly (e.g. entire sentences or paragraphs). This is not one-to-one replacement for categorical sequence handling in TF1, but may offer a convinient replacement for ad-hoc text preprocessing."
964+
"† `layers.TextVectorization` can handle freeform text input directly (e.g. entire sentences or paragraphs). This is not one-to-one replacement for categorical sequence handling in TF1, but may offer a convinient replacement for ad-hoc text preprocessing.\n",
965+
"\n",
966+
"Note: Linear estimators, such as `tf.estimator.LinearClassifier`, can handle direct categorical input (integer indices) without an `embedding_column` or `indicator_column`. However, integer indices cannot be passed directly to `tf.keras.layers.Dense` or `tf.keras.experimental.LinearModel`. These inputs should be first encoded with `tf.layers.CategoryEncoding` with `output_mode='count'` (and `sparse=True` if the category sizes are large) before calling into `Dense` or `LinearModel`."
847967
]
848968
},
849969
{
@@ -863,7 +983,6 @@
863983
"colab": {
864984
"collapsed_sections": [],
865985
"name": "migrating_feature_columns.ipynb",
866-
"provenance": [],
867986
"toc_visible": true
868987
},
869988
"kernelspec": {

0 commit comments

Comments
 (0)