You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,10 @@
1
1
## Changelog 🔄
2
2
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
3
3
4
+
## [3.1.0] - 2025-02-16
5
+
### Added
6
+
- Introduced a new `cache_maxsize` argument to `chunkerify()` and `chunk()` that specifies the maximum number of text-token count pairs that can be stored in a token counter's cache. The argument defaults to `None`, in which case the cache is unbounded.
7
+
4
8
## [3.0.4] - 2025-02-14
5
9
### Fixed
6
10
- Fixed bug where attempting to chunk only whitespace characters would raise `ValueError: not enough values to unpack (expected 2, got 0)` ([ScrapeGraphAI/Scrapegraph-ai#893](https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/893)).
@@ -125,6 +129,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
125
129
### Added
126
130
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
`memoize` flags whether to memoize the token counter. It defaults to `True`.
86
87
88
+
`cache_maxsize` is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.
89
+
87
90
This function returns a chunker that takes either a single text or a sequence of texts and returns, depending on whether multiple texts have been provided, a list or list of lists of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, and, if the optional `offsets` argument to the chunker is `True`, a list or lists of tuples of the form `(start, end)` where `start` is the index of the first character of a chunk in a text and `end` is the index of the character succeeding the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.
88
91
89
92
The resulting chunker can be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.
@@ -103,6 +106,7 @@ def chunk(
103
106
memoize: bool=True,
104
107
offsets: bool=False,
105
108
overlap: float|int|None=None,
109
+
cache_maxsize: int|None=None,
106
110
) -> list[str]
107
111
```
108
112
@@ -120,6 +124,8 @@ def chunk(
120
124
121
125
`overlap` specifies the proportion of the chunk size, or, if>=1, the number of tokens, by which chunks should overlap. It defaults to `None`, in which case no overlapping occurs.
122
126
127
+
`cache_maxsize`is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.
128
+
123
129
This function returns a list of chunks up to `chunk_size`-tokens-long, withany whitespace used to split the text removed, and, if`offsets`is`True`, a list of tuples of the form `(start, end)` where `start`is the index of the first character of the chunk in the original text and`end`is the index of the character after the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.
memoize (bool, optional): Whether to memoize the token counter. Defaults to `True`.
152
153
offsets (bool, optional): Whether to return the start and end offsets of each chunk. Defaults to `False`.
153
154
overlap (float | int | None, optional): The proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. Defaults to `None`, in which case no overlapping occurs.
155
+
cache_maxsize (int | None, optional): The maximum number of text-token count pairs that can be stored in the token counter's cache. Defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.
154
156
155
157
Returns:
156
158
list[str] | tuple[list[str], list[tuple[int, int]]]: A list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed, and, if `offsets` is `True`, a list of tuples of the form `(start, end)` where `start` is the index of the first character of the chunk in the original text and `end` is the index of the character after the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`."""
@@ -162,7 +164,7 @@ def chunk(
162
164
# If this is the first call, memoize the token counter if memoization is enabled and reduce the effective chunk size if overlapping chunks.
# Make relative overlaps absolute and floor both relative and absolute overlaps to prevent ever having an overlap >= chunk_size.
@@ -377,6 +379,7 @@ def chunkerify(
377
379
chunk_size: int|None=None,
378
380
max_token_chars: int|None=None,
379
381
memoize: bool=True,
382
+
cache_maxsize: int|None=None,
380
383
) ->Chunker:
381
384
"""Construct a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.
382
385
@@ -385,6 +388,7 @@ def chunkerify(
385
388
chunk_size (int, optional): The maximum number of tokens a chunk may contain. Defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible otherwise a `ValueError` will be raised.
386
389
max_token_chars (int, optional): The maximum numbers of characters a token may contain. Used to significantly speed up the token counting of long inputs. Defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
387
390
memoize (bool, optional): Whether to memoize the token counter. Defaults to `True`.
391
+
cache_maxsize (int, optional): The maximum number of text-token count pairs that can be stored in the token counter's cache. Defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.
388
392
389
393
Returns:
390
394
Callable[[str | Sequence[str], bool, bool, bool, int | float | None], list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]]: A chunker that takes either a single text or a sequence of texts and returns, depending on whether multiple texts have been provided, a list or list of lists of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, and, if the optional `offsets` argument to the chunker is `True`, a list or lists of tuples of the form `(start, end)` where `start` is the index of the first character of a chunk in a text and `end` is the index of the character succeeding the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.
0 commit comments