Skip to content

Commit 397d3b8

Browse files
committed
update layer_string_lookup()
1 parent bf1ae99 commit 397d3b8

File tree

5 files changed

+114
-89
lines changed

5 files changed

+114
-89
lines changed

NEWS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
- `layer_hashing()` gains `output_mode` and `sparse` arguments.
1818
- `layer_integer_lookup()` gains `vocabulary_dtype` and `idf_weights` arguments.
1919
- `layer_normalization()` gains an `invert` argument.
20-
20+
- `layer_string_lookup()` gains an `idf_weights` argument.
2121

2222
- Fixed issue where `input_shape` supplied to custom layers defined with `new_layer_class()`
2323
would result in an error (#1338)

R/layers-preprocessing.R

Lines changed: 55 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -977,14 +977,16 @@ function(object,
977977
#'
978978
#' @details
979979
#' This layer translates a set of arbitrary strings into integer output via a
980-
#' table-based vocabulary lookup.
980+
#' table-based vocabulary lookup. This layer will perform no splitting or
981+
#' transformation of input strings. For a layer than can split and tokenize
982+
#' natural language, see the `layer_text_vectorization()` layer.
981983
#'
982984
#' The vocabulary for the layer must be either supplied on construction or
983985
#' learned via `adapt()`. During `adapt()`, the layer will analyze a data set,
984-
#' determine the frequency of individual strings tokens, and create a vocabulary
985-
#' from them. If the vocabulary is capped in size, the most frequent tokens will
986-
#' be used to create the vocabulary and all others will be treated as
987-
#' out-of-vocabulary (OOV).
986+
#' determine the frequency of individual strings tokens, and create a
987+
#' vocabulary from them. If the vocabulary is capped in size, the most frequent
988+
#' tokens will be used to create the vocabulary and all others will be treated
989+
#' as out-of-vocabulary (OOV).
988990
#'
989991
#' There are two possible output modes for the layer.
990992
#' When `output_mode` is `"int"`,
@@ -996,60 +998,68 @@ function(object,
996998
#' The vocabulary can optionally contain a mask token as well as an OOV token
997999
#' (which can optionally occupy multiple indices in the vocabulary, as set
9981000
#' by `num_oov_indices`).
999-
#' The position of these tokens in the vocabulary is fixed. When `output_mode` is
1000-
#' `"int"`, the vocabulary will begin with the mask token (if set), followed by
1001-
#' OOV indices, followed by the rest of the vocabulary. When `output_mode` is
1002-
#' `"multi_hot"`, `"count"`, or `"tf_idf"` the vocabulary will begin with OOV
1003-
#' indices and instances of the mask token will be dropped.
1001+
#' The position of these tokens in the vocabulary is fixed. When `output_mode`
1002+
#' is `"int"`, the vocabulary will begin with the mask token (if set), followed
1003+
#' by OOV indices, followed by the rest of the vocabulary. When `output_mode`
1004+
#' is `"multi_hot"`, `"count"`, or `"tf_idf"` the vocabulary will begin with
1005+
#' OOV indices and instances of the mask token will be dropped.
10041006
#'
1005-
#' @inheritParams layer_dense
1007+
#' For an overview and full list of preprocessing layers, see the preprocessing
1008+
#' [guide](https://www.tensorflow.org/guide/keras/preprocessing_layers).
10061009
#'
1007-
#' @param max_tokens The maximum size of the vocabulary for this layer. If `NULL`,
1008-
#' there is no cap on the size of the vocabulary. Note that this size
1009-
#' includes the OOV and mask tokens. Default to `NULL.`
1010+
#' @param max_tokens Maximum size of the vocabulary for this layer. This should
1011+
#' only be specified when adapting the vocabulary or when setting
1012+
#' `pad_to_max_tokens = TRUE`. If NULL, there is no cap on the size of the
1013+
#' vocabulary. Note that this size includes the OOV and mask tokens.
1014+
#' Defaults to NULL.
10101015
#'
10111016
#' @param num_oov_indices The number of out-of-vocabulary tokens to use. If this
1012-
#' value is more than 1, OOV inputs are hashed to determine their OOV value.
1013-
#' If this value is 0, OOV inputs will cause an error when calling the layer.
1014-
#' Defaults to 1.
1017+
#' value is more than 1, OOV inputs are hashed to determine their OOV
1018+
#' value. If this value is 0, OOV inputs will cause an error when calling
1019+
#' the layer. Defaults to 1.
10151020
#'
10161021
#' @param mask_token A token that represents masked inputs. When `output_mode` is
10171022
#' `"int"`, the token is included in vocabulary and mapped to index 0. In
10181023
#' other output modes, the token will not appear in the vocabulary and
1019-
#' instances of the mask token in the input will be dropped. If set to `NULL`,
1020-
#' no mask term will be added. Defaults to `NULL`.
1024+
#' instances of the mask token in the input will be dropped. If set to
1025+
#' NULL, no mask term will be added. Defaults to `NULL`.
10211026
#'
1022-
#' @param oov_token Only used when `invert` is TRUE. The token to return for OOV
1027+
#' @param oov_token Only used when `invert` is `TRUE`. The token to return for OOV
10231028
#' indices. Defaults to `"[UNK]"`.
10241029
#'
1025-
#' @param vocabulary Optional. Either an array of strings or a string path to a text
1026-
#' file. If passing an array, can pass a list, list, 1D numpy array, or 1D
1027-
#' tensor containing the string vocabulary terms. If passing a file path, the
1028-
#' file should contain one line per term in the vocabulary. If this argument
1029-
#' is set, there is no need to `adapt` the layer.
1030+
#' @param vocabulary Optional. Either an array of strings or a string path to a
1031+
#' text file. If passing an array, can pass a character vector or
1032+
#' or 1D tensor containing the string vocabulary terms. If passing a file
1033+
#' path, the file should contain one line per term in the vocabulary. If
1034+
#' this argument is set, there is no need to `adapt()` the layer.
10301035
#'
1031-
#' @param encoding String encoding. Default of `NULL` is equivalent to `"utf-8"`.
1036+
#' @param idf_weights Only valid when `output_mode` is `"tf_idf"`.
1037+
#' An array, or 1D tensor or the same length as the vocabulary,
1038+
#' containing the floating point inverse document frequency weights, which
1039+
#' will be multiplied by per sample term counts for the final `tf_idf`
1040+
#' weight. If the `vocabulary` argument is set, and `output_mode` is
1041+
#' `"tf_idf"`, this argument must be supplied.
10321042
#'
1033-
#' @param invert Only valid when `output_mode` is `"int"`. If TRUE, this layer will
1043+
#' @param invert Only valid when `output_mode` is `"int"`. If `TRUE`, this layer will
10341044
#' map indices to vocabulary items instead of mapping vocabulary items to
10351045
#' indices. Default to `FALSE`.
10361046
#'
1037-
#' @param output_mode Specification for the output of the layer. Defaults to `"int"`.
1038-
#' Values can be `"int"`, `"one_hot"`, `"multi_hot"`, `"count"`, or
1039-
#' `"tf_idf"` configuring the layer as follows:
1047+
#' @param output_mode Specification for the output of the layer. Defaults to
1048+
#' `"int"`. Values can be `"int"`, `"one_hot"`, `"multi_hot"`, `"count"`,
1049+
#' or `"tf_idf"` configuring the layer as follows:
10401050
#' - `"int"`: Return the raw integer indices of the input tokens.
10411051
#' - `"one_hot"`: Encodes each individual element in the input into an
10421052
#' array the same size as the vocabulary, containing a 1 at the element
1043-
#' index. If the last dimension is size 1, will encode on that dimension.
1044-
#' If the last dimension is not size 1, will append a new dimension for
1045-
#' the encoded output.
1053+
#' index. If the last dimension is size 1, will encode on that
1054+
#' dimension. If the last dimension is not size 1, will append a new
1055+
#' dimension for the encoded output.
10461056
#' - `"multi_hot"`: Encodes each sample in the input into a single array
10471057
#' the same size as the vocabulary, containing a 1 for each vocabulary
10481058
#' term present in the sample. Treats the last dimension as the sample
10491059
#' dimension, if input shape is (..., sample_length), output shape will
10501060
#' be (..., num_tokens).
1051-
#' - `"count"`: As `"multi_hot"`, but the int array contains a count of the
1052-
#' number of times the token at that index appeared in the sample.
1061+
#' - `"count"`: As `"multi_hot"`, but the int array contains a count of
1062+
#' the number of times the token at that index appeared in the sample.
10531063
#' - `"tf_idf"`: As `"multi_hot"`, but the TF-IDF algorithm is applied to
10541064
#' find the value in each token slot.
10551065
#' For `"int"` output, any shape of input and output is supported. For all
@@ -1059,12 +1069,16 @@ function(object,
10591069
#' `"count"`, or `"tf_idf"`. If TRUE, the output will have its feature axis
10601070
#' padded to `max_tokens` even if the number of unique tokens in the
10611071
#' vocabulary is less than max_tokens, resulting in a tensor of shape
1062-
#' `[batch_size, max_tokens]` regardless of vocabulary size. Defaults to `FALSE`.
1072+
#' [batch_size, max_tokens] regardless of vocabulary size. Defaults to
1073+
#' FALSE.
10631074
#'
10641075
#' @param sparse Boolean. Only applicable when `output_mode` is `"multi_hot"`,
1065-
#' `"count"`, or `"tf_idf"`. If TRUE, returns a `SparseTensor` instead of a
1076+
#' `"count"`, or `"tf_idf"`. If `TRUE`, returns a `SparseTensor` instead of a
10661077
#' dense `Tensor`. Defaults to `FALSE`.
10671078
#'
1079+
#' @param encoding Optional. The text encoding to use to interpret the input
1080+
#' strings. Defaults to `"utf-8"`.
1081+
#'
10681082
#' @param ... standard layer arguments.
10691083
#'
10701084
#' @family categorical features preprocessing layers
@@ -1081,11 +1095,12 @@ function(object,
10811095
max_tokens = NULL,
10821096
num_oov_indices = 1L,
10831097
mask_token = NULL,
1084-
oov_token = '[UNK]',
1098+
oov_token = "[UNK]",
10851099
vocabulary = NULL,
1086-
encoding = NULL,
1100+
idf_weights = NULL,
1101+
encoding = "utf-8",
10871102
invert = FALSE,
1088-
output_mode = 'int',
1103+
output_mode = "int",
10891104
sparse = FALSE,
10901105
pad_to_max_tokens = FALSE,
10911106
...)

man/layer_normalization.Rd

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/layer_string_lookup.Rd

Lines changed: 54 additions & 46 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

tools/make-layer-wrapper.R

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,4 +168,6 @@ print.r_py_wrapper2 <- function(x, ...) {
168168
# new_layer_wrapper(keras$layers$GaussianDropout) |> print()
169169
# new_layer_wrapper(keras$layers$GaussianNoise) |> print()
170170
# new_layer_wrapper(keras$layers$IntegerLookup) |> print()
171-
new_layer_wrapper(keras$layers$Normalization) |> print()
171+
# new_layer_wrapper(keras$layers$Normalization) |> print()
172+
new_layer_wrapper(keras$layers$StringLookup) |> print()
173+

0 commit comments

Comments
 (0)