|
463 | 463 | "embedding(string_lookup_layer(['small', 'medium', 'large']))"
|
464 | 464 | ]
|
465 | 465 | },
|
| 466 | + { |
| 467 | + "cell_type": "markdown", |
| 468 | + "metadata": { |
| 469 | + "id": "UwqvADV6HRdC" |
| 470 | + }, |
| 471 | + "source": [ |
| 472 | + "## Summing weighted categorical data\n", |
| 473 | + "\n", |
| 474 | + "In some cases, you need to deal with categorical data where each occurance of a category comes with an associated weight. In feature columns, this is handled with `tf.feature_column.weighted_categorical_column`. When paired with an `indicator_column`, this has the effect of summing weights per category." |
| 475 | + ] |
| 476 | + }, |
| 477 | + { |
| 478 | + "cell_type": "code", |
| 479 | + "execution_count": null, |
| 480 | + "metadata": { |
| 481 | + "id": "02HqjPLMRxWn" |
| 482 | + }, |
| 483 | + "outputs": [], |
| 484 | + "source": [ |
| 485 | + "ids = tf.constant([[5, 11, 5, 17, 17]])\n", |
| 486 | + "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n", |
| 487 | + "\n", |
| 488 | + "categorical_col = tf1.feature_column.categorical_column_with_identity(\n", |
| 489 | + " 'ids', num_buckets=20)\n", |
| 490 | + "weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n", |
| 491 | + " categorical_col, 'weights')\n", |
| 492 | + "indicator_col = tf1.feature_column.indicator_column(weighted_categorical_col)\n", |
| 493 | + "call_feature_columns(indicator_col, {'ids': ids, 'weights': weights})" |
| 494 | + ] |
| 495 | + }, |
| 496 | + { |
| 497 | + "cell_type": "markdown", |
| 498 | + "metadata": { |
| 499 | + "id": "98jaq7Q3S9aG" |
| 500 | + }, |
| 501 | + "source": [ |
| 502 | + "In Keras, this can be done by passing a `count_weights` input to `tf.keras.layers.CategoryEncoding` with `output_mode='count'`." |
| 503 | + ] |
| 504 | + }, |
| 505 | + { |
| 506 | + "cell_type": "code", |
| 507 | + "execution_count": null, |
| 508 | + "metadata": { |
| 509 | + "id": "JsoYUUgRS7hu" |
| 510 | + }, |
| 511 | + "outputs": [], |
| 512 | + "source": [ |
| 513 | + "ids = tf.constant([[5, 11, 5, 17, 17]])\n", |
| 514 | + "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n", |
| 515 | + "\n", |
| 516 | + "# Using sparse output is more efficient when `num_tokens` is large.\n", |
| 517 | + "count_layer = tf.keras.layers.CategoryEncoding(\n", |
| 518 | + " num_tokens=20, output_mode='count', sparse=True)\n", |
| 519 | + "tf.sparse.to_dense(count_layer(ids, count_weights=weights))" |
| 520 | + ] |
| 521 | + }, |
| 522 | + { |
| 523 | + "cell_type": "markdown", |
| 524 | + "metadata": { |
| 525 | + "id": "gBJxb6y2GasI" |
| 526 | + }, |
| 527 | + "source": [ |
| 528 | + "## Embedding weighted categorical data\n", |
| 529 | + "\n", |
| 530 | + "You might alternately want to embed weighted categorical inputs. In feature columns, the `embedding_column` contains a `combiner` argument. If any sample\n", |
| 531 | + "contains multiple entries for a category, they will be combined according to the argument setting (by default `'mean'`)." |
| 532 | + ] |
| 533 | + }, |
| 534 | + { |
| 535 | + "cell_type": "code", |
| 536 | + "execution_count": null, |
| 537 | + "metadata": { |
| 538 | + "id": "AjOt1wgmT5mM" |
| 539 | + }, |
| 540 | + "outputs": [], |
| 541 | + "source": [ |
| 542 | + "ids = tf.constant([[5, 11, 5, 17, 17]])\n", |
| 543 | + "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n", |
| 544 | + "\n", |
| 545 | + "categorical_col = tf1.feature_column.categorical_column_with_identity(\n", |
| 546 | + " 'ids', num_buckets=20)\n", |
| 547 | + "weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n", |
| 548 | + " categorical_col, 'weights')\n", |
| 549 | + "embedding_col = tf1.feature_column.embedding_column(\n", |
| 550 | + " weighted_categorical_col, 4, combiner='mean')\n", |
| 551 | + "call_feature_columns(embedding_col, {'ids': ids, 'weights': weights})" |
| 552 | + ] |
| 553 | + }, |
| 554 | + { |
| 555 | + "cell_type": "markdown", |
| 556 | + "metadata": { |
| 557 | + "id": "fd6eluARXndC" |
| 558 | + }, |
| 559 | + "source": [ |
| 560 | + "In Keras, there is no `combiner` option to `tf.keras.layers.Embedding`, but you can acheive the same effect with `tf.keras.layers.Dense`. The `embedding_column` above is simply linearly combining embedding vectors according to category weight. Though not obvious at first, it is exactly equivalent to representing your categorical inputs as a sparse weight vector of size `(num_tokens)`, and mutiplying them by a `Dense` kernel of shape `(embedding_size, num_tokens)`." |
| 561 | + ] |
| 562 | + }, |
| 563 | + { |
| 564 | + "cell_type": "code", |
| 565 | + "execution_count": null, |
| 566 | + "metadata": { |
| 567 | + "id": "Y-vZvPyiYilE" |
| 568 | + }, |
| 569 | + "outputs": [], |
| 570 | + "source": [ |
| 571 | + "ids = tf.constant([[5, 11, 5, 17, 17]])\n", |
| 572 | + "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n", |
| 573 | + "\n", |
| 574 | + "# For `combiner='mean'`, normalize your weights to sum to 1. Removing this line\n", |
| 575 | + "# would be eqivalent to an `embedding_column` with `combiner='sum'`.\n", |
| 576 | + "weights = weights / tf.reduce_sum(weights, axis=-1, keepdims=True)\n", |
| 577 | + "\n", |
| 578 | + "count_layer = tf.keras.layers.CategoryEncoding(\n", |
| 579 | + " num_tokens=20, output_mode='count', sparse=True)\n", |
| 580 | + "embedding_layer = tf.keras.layers.Dense(4, use_bias=False)\n", |
| 581 | + "embedding_layer(count_layer(ids, count_weights=weights))" |
| 582 | + ] |
| 583 | + }, |
466 | 584 | {
|
467 | 585 | "cell_type": "markdown",
|
468 | 586 | "metadata": {
|
|
843 | 961 | "\n",
|
844 | 962 | "\\* `output_mode` can be passed to `layers.CategoryEncoding`, `layers.StringLookup`, `layers.IntegerLookup`, and `layers.TextVectorization`.\n",
|
845 | 963 | "\n",
|
846 |
| - "† `layers.TextVectorization` can handle freeform text input directly (e.g. entire sentences or paragraphs). This is not one-to-one replacement for categorical sequence handling in TF1, but may offer a convinient replacement for ad-hoc text preprocessing." |
| 964 | + "† `layers.TextVectorization` can handle freeform text input directly (e.g. entire sentences or paragraphs). This is not one-to-one replacement for categorical sequence handling in TF1, but may offer a convinient replacement for ad-hoc text preprocessing.\n", |
| 965 | + "\n", |
| 966 | + "Note: Linear estimators, such as `tf.estimator.LinearClassifier`, can handle direct categorical input (integer indices) without an `embedding_column` or `indicator_column`. However, integer indices cannot be passed directly to `tf.keras.layers.Dense` or `tf.keras.experimental.LinearModel`. These inputs should be first encoded with `tf.layers.CategoryEncoding` with `output_mode='count'` (and `sparse=True` if the category sizes are large) before calling into `Dense` or `LinearModel`." |
847 | 967 | ]
|
848 | 968 | },
|
849 | 969 | {
|
|
863 | 983 | "colab": {
|
864 | 984 | "collapsed_sections": [],
|
865 | 985 | "name": "migrating_feature_columns.ipynb",
|
866 |
| - "provenance": [], |
867 | 986 | "toc_visible": true
|
868 | 987 | },
|
869 | 988 | "kernelspec": {
|
|
0 commit comments