You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,7 +57,7 @@ def chunkerify(
57
57
58
58
`max_token_chars` is the maximum numbers of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
59
59
60
-
`memoize` flags whether to memoise the token counter. It defaults to `True`.
60
+
`memoize` flags whether to memoize the token counter. It defaults to `True`.
61
61
62
62
This function returns a callable that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
63
63
@@ -79,7 +79,7 @@ def chunk(
79
79
80
80
`token_counter`is a callable that takes a string and returns the number of tokens in it.
81
81
82
-
`memoize` flags whether to memoise the token counter. It defaults to `True`.
82
+
`memoize` flags whether to memoize the token counter. It defaults to `True`.
83
83
84
84
This function returns a list of chunks up to `chunk_size`-tokens-long, withany whitespace used to split the text removed.
# Memoise the `chunk` function, preserving its signature and docstring.
142
+
# Memoize the `chunk` function, preserving its signature and docstring.
143
143
chunk=wraps(chunk)(cache(chunk))
144
144
145
145
defchunkerify(
@@ -155,7 +155,7 @@ def chunkerify(
155
155
tokenizer_or_token_counter (str | tiktoken.Encoding | transformers.PreTrainedTokenizer | tokenizers.Tokenizer | Callable[[str], int]): Either: the name of a `tiktoken` or `transformers` tokenizer (with priority given to the former); a tokenizer that possesses an `encode` attribute (eg, a `tiktoken`, `transformers` or `tokenizers` tokenizer); or a token counter that returns the number of tokens in a input.
156
156
chunk_size (int, optional): The maximum number of tokens a chunk may contain. Defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible otherwise a `ValueError` will be raised.
157
157
max_token_chars (int, optional): The maximum numbers of characters a token may contain. Used to significantly speed up the token counting of long inputs. Defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
158
-
memoize (bool, optional): Whether to memoise the token counter. Defaults to `True`.
158
+
memoize (bool, optional): Whether to memoize the token counter. Defaults to `True`.
159
159
160
160
Returns:
161
161
Callable[[str | Sequence[str]], list[str] | list[list[str]]]: A function that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts."""
0 commit comments