@@ -977,14 +977,16 @@ function(object,
977977# '
978978# ' @details
979979# ' This layer translates a set of arbitrary strings into integer output via a
980- # ' table-based vocabulary lookup.
980+ # ' table-based vocabulary lookup. This layer will perform no splitting or
981+ # ' transformation of input strings. For a layer than can split and tokenize
982+ # ' natural language, see the `layer_text_vectorization()` layer.
981983# '
982984# ' The vocabulary for the layer must be either supplied on construction or
983985# ' learned via `adapt()`. During `adapt()`, the layer will analyze a data set,
984- # ' determine the frequency of individual strings tokens, and create a vocabulary
985- # ' from them. If the vocabulary is capped in size, the most frequent tokens will
986- # ' be used to create the vocabulary and all others will be treated as
987- # ' out-of-vocabulary (OOV).
986+ # ' determine the frequency of individual strings tokens, and create a
987+ # ' vocabulary from them. If the vocabulary is capped in size, the most frequent
988+ # ' tokens will be used to create the vocabulary and all others will be treated
989+ # ' as out-of-vocabulary (OOV).
988990# '
989991# ' There are two possible output modes for the layer.
990992# ' When `output_mode` is `"int"`,
@@ -996,60 +998,68 @@ function(object,
996998# ' The vocabulary can optionally contain a mask token as well as an OOV token
997999# ' (which can optionally occupy multiple indices in the vocabulary, as set
9981000# ' by `num_oov_indices`).
999- # ' The position of these tokens in the vocabulary is fixed. When `output_mode` is
1000- # ' `"int"`, the vocabulary will begin with the mask token (if set), followed by
1001- # ' OOV indices, followed by the rest of the vocabulary. When `output_mode` is
1002- # ' `"multi_hot"`, `"count"`, or `"tf_idf"` the vocabulary will begin with OOV
1003- # ' indices and instances of the mask token will be dropped.
1001+ # ' The position of these tokens in the vocabulary is fixed. When `output_mode`
1002+ # ' is `"int"`, the vocabulary will begin with the mask token (if set), followed
1003+ # ' by OOV indices, followed by the rest of the vocabulary. When `output_mode`
1004+ # ' is `"multi_hot"`, `"count"`, or `"tf_idf"` the vocabulary will begin with
1005+ # ' OOV indices and instances of the mask token will be dropped.
10041006# '
1005- # ' @inheritParams layer_dense
1007+ # ' For an overview and full list of preprocessing layers, see the preprocessing
1008+ # ' [guide](https://www.tensorflow.org/guide/keras/preprocessing_layers).
10061009# '
1007- # ' @param max_tokens The maximum size of the vocabulary for this layer. If `NULL`,
1008- # ' there is no cap on the size of the vocabulary. Note that this size
1009- # ' includes the OOV and mask tokens. Default to `NULL.`
1010+ # ' @param max_tokens Maximum size of the vocabulary for this layer. This should
1011+ # ' only be specified when adapting the vocabulary or when setting
1012+ # ' `pad_to_max_tokens = TRUE`. If NULL, there is no cap on the size of the
1013+ # ' vocabulary. Note that this size includes the OOV and mask tokens.
1014+ # ' Defaults to NULL.
10101015# '
10111016# ' @param num_oov_indices The number of out-of-vocabulary tokens to use. If this
1012- # ' value is more than 1, OOV inputs are hashed to determine their OOV value.
1013- # ' If this value is 0, OOV inputs will cause an error when calling the layer.
1014- # ' Defaults to 1.
1017+ # ' value is more than 1, OOV inputs are hashed to determine their OOV
1018+ # ' value. If this value is 0, OOV inputs will cause an error when calling
1019+ # ' the layer. Defaults to 1.
10151020# '
10161021# ' @param mask_token A token that represents masked inputs. When `output_mode` is
10171022# ' `"int"`, the token is included in vocabulary and mapped to index 0. In
10181023# ' other output modes, the token will not appear in the vocabulary and
1019- # ' instances of the mask token in the input will be dropped. If set to `NULL`,
1020- # ' no mask term will be added. Defaults to `NULL`.
1024+ # ' instances of the mask token in the input will be dropped. If set to
1025+ # ' NULL, no mask term will be added. Defaults to `NULL`.
10211026# '
1022- # ' @param oov_token Only used when `invert` is TRUE. The token to return for OOV
1027+ # ' @param oov_token Only used when `invert` is ` TRUE` . The token to return for OOV
10231028# ' indices. Defaults to `"[UNK]"`.
10241029# '
1025- # ' @param vocabulary Optional. Either an array of strings or a string path to a text
1026- # ' file. If passing an array, can pass a list, list, 1D numpy array, or 1D
1027- # ' tensor containing the string vocabulary terms. If passing a file path, the
1028- # ' file should contain one line per term in the vocabulary. If this argument
1029- # ' is set, there is no need to `adapt` the layer.
1030+ # ' @param vocabulary Optional. Either an array of strings or a string path to a
1031+ # ' text file. If passing an array, can pass a character vector or
1032+ # ' or 1D tensor containing the string vocabulary terms. If passing a file
1033+ # ' path, the file should contain one line per term in the vocabulary. If
1034+ # ' this argument is set, there is no need to `adapt() ` the layer.
10301035# '
1031- # ' @param encoding String encoding. Default of `NULL` is equivalent to `"utf-8"`.
1036+ # ' @param idf_weights Only valid when `output_mode` is `"tf_idf"`.
1037+ # ' An array, or 1D tensor or the same length as the vocabulary,
1038+ # ' containing the floating point inverse document frequency weights, which
1039+ # ' will be multiplied by per sample term counts for the final `tf_idf`
1040+ # ' weight. If the `vocabulary` argument is set, and `output_mode` is
1041+ # ' `"tf_idf"`, this argument must be supplied.
10321042# '
1033- # ' @param invert Only valid when `output_mode` is `"int"`. If TRUE, this layer will
1043+ # ' @param invert Only valid when `output_mode` is `"int"`. If ` TRUE` , this layer will
10341044# ' map indices to vocabulary items instead of mapping vocabulary items to
10351045# ' indices. Default to `FALSE`.
10361046# '
1037- # ' @param output_mode Specification for the output of the layer. Defaults to `"int"`.
1038- # ' Values can be `"int"`, `"one_hot"`, `"multi_hot"`, `"count"`, or
1039- # ' `"tf_idf"` configuring the layer as follows:
1047+ # ' @param output_mode Specification for the output of the layer. Defaults to
1048+ # ' `"int"`. Values can be `"int"`, `"one_hot"`, `"multi_hot"`, `"count"`,
1049+ # ' or `"tf_idf"` configuring the layer as follows:
10401050# ' - `"int"`: Return the raw integer indices of the input tokens.
10411051# ' - `"one_hot"`: Encodes each individual element in the input into an
10421052# ' array the same size as the vocabulary, containing a 1 at the element
1043- # ' index. If the last dimension is size 1, will encode on that dimension.
1044- # ' If the last dimension is not size 1, will append a new dimension for
1045- # ' the encoded output.
1053+ # ' index. If the last dimension is size 1, will encode on that
1054+ # ' dimension. If the last dimension is not size 1, will append a new
1055+ # ' dimension for the encoded output.
10461056# ' - `"multi_hot"`: Encodes each sample in the input into a single array
10471057# ' the same size as the vocabulary, containing a 1 for each vocabulary
10481058# ' term present in the sample. Treats the last dimension as the sample
10491059# ' dimension, if input shape is (..., sample_length), output shape will
10501060# ' be (..., num_tokens).
1051- # ' - `"count"`: As `"multi_hot"`, but the int array contains a count of the
1052- # ' number of times the token at that index appeared in the sample.
1061+ # ' - `"count"`: As `"multi_hot"`, but the int array contains a count of
1062+ # ' the number of times the token at that index appeared in the sample.
10531063# ' - `"tf_idf"`: As `"multi_hot"`, but the TF-IDF algorithm is applied to
10541064# ' find the value in each token slot.
10551065# ' For `"int"` output, any shape of input and output is supported. For all
@@ -1059,12 +1069,16 @@ function(object,
10591069# ' `"count"`, or `"tf_idf"`. If TRUE, the output will have its feature axis
10601070# ' padded to `max_tokens` even if the number of unique tokens in the
10611071# ' vocabulary is less than max_tokens, resulting in a tensor of shape
1062- # ' `[batch_size, max_tokens]` regardless of vocabulary size. Defaults to `FALSE`.
1072+ # ' [batch_size, max_tokens] regardless of vocabulary size. Defaults to
1073+ # ' FALSE.
10631074# '
10641075# ' @param sparse Boolean. Only applicable when `output_mode` is `"multi_hot"`,
1065- # ' `"count"`, or `"tf_idf"`. If TRUE, returns a `SparseTensor` instead of a
1076+ # ' `"count"`, or `"tf_idf"`. If ` TRUE` , returns a `SparseTensor` instead of a
10661077# ' dense `Tensor`. Defaults to `FALSE`.
10671078# '
1079+ # ' @param encoding Optional. The text encoding to use to interpret the input
1080+ # ' strings. Defaults to `"utf-8"`.
1081+ # '
10681082# ' @param ... standard layer arguments.
10691083# '
10701084# ' @family categorical features preprocessing layers
@@ -1081,11 +1095,12 @@ function(object,
10811095 max_tokens = NULL ,
10821096 num_oov_indices = 1L ,
10831097 mask_token = NULL ,
1084- oov_token = ' [UNK]' ,
1098+ oov_token = " [UNK]" ,
10851099 vocabulary = NULL ,
1086- encoding = NULL ,
1100+ idf_weights = NULL ,
1101+ encoding = " utf-8" ,
10871102 invert = FALSE ,
1088- output_mode = ' int' ,
1103+ output_mode = " int" ,
10891104 sparse = FALSE ,
10901105 pad_to_max_tokens = FALSE ,
10911106 ... )
0 commit comments