Skip to content

Commit 7a7b000

Browse files
committed
Introduced new cache_maxsize argument.
1 parent 542e508 commit 7a7b000

File tree

3 files changed

+18
-3
lines changed

3 files changed

+18
-3
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [3.1.0] - 2025-02-16
5+
### Added
6+
- Introduced a new `cache_maxsize` argument to `chunkerify()` and `chunk()` that specifies the maximum number of text-token count pairs that can be stored in a token counter's cache. The argument defaults to `None`, in which case the cache is unbounded.
7+
48
## [3.0.4] - 2025-02-14
59
### Fixed
610
- Fixed bug where attempting to chunk only whitespace characters would raise `ValueError: not enough values to unpack (expected 2, got 0)` ([ScrapeGraphAI/Scrapegraph-ai#893](https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/893)).
@@ -125,6 +129,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
125129
### Added
126130
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
127131

132+
[3.1.0]: https://github.com/isaacus-dev/semchunk/compare/v3.0.4...v3.1.0
128133
[3.0.4]: https://github.com/isaacus-dev/semchunk/compare/v3.0.3...v3.0.4
129134
[3.0.3]: https://github.com/isaacus-dev/semchunk/compare/v3.0.2...v3.0.3
130135
[3.0.2]: https://github.com/isaacus-dev/semchunk/compare/v3.0.1...v3.0.2

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ def chunkerify(
7171
chunk_size: int = None,
7272
max_token_chars: int = None,
7373
memoize: bool = True,
74+
cache_maxsize: int | None = None,
7475
) -> Callable[[str | Sequence[str], bool, bool, bool, int | float | None], list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]]:
7576
```
7677

@@ -84,6 +85,8 @@ def chunkerify(
8485

8586
`memoize` flags whether to memoize the token counter. It defaults to `True`.
8687

88+
`cache_maxsize` is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.
89+
8790
This function returns a chunker that takes either a single text or a sequence of texts and returns, depending on whether multiple texts have been provided, a list or list of lists of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, and, if the optional `offsets` argument to the chunker is `True`, a list or lists of tuples of the form `(start, end)` where `start` is the index of the first character of a chunk in a text and `end` is the index of the character succeeding the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.
8891

8992
The resulting chunker can be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.
@@ -103,6 +106,7 @@ def chunk(
103106
memoize: bool = True,
104107
offsets: bool = False,
105108
overlap: float | int | None = None,
109+
cache_maxsize: int | None = None,
106110
) -> list[str]
107111
```
108112

@@ -120,6 +124,8 @@ def chunk(
120124

121125
`overlap` specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to `None`, in which case no overlapping occurs.
122126

127+
`cache_maxsize` is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.
128+
123129
This function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed, and, if `offsets` is `True`, a list of tuples of the form `(start, end)` where `start` is the index of the first character of the chunk in the original text and `end` is the index of the character after the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.
124130

125131
## How It Works 🔍

src/semchunk/semchunk.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@
55
import inspect
66

77
from typing import Callable, Sequence, TYPE_CHECKING
8-
from functools import cache
98
from itertools import accumulate
109
from contextlib import suppress
10+
from functools import lru_cache
1111

1212
import mpire
1313

@@ -139,6 +139,7 @@ def chunk(
139139
memoize: bool = True,
140140
offsets: bool = False,
141141
overlap: float | int | None = None,
142+
cache_maxsize: int | None = None,
142143
_recursion_depth: int = 0,
143144
_start: int = 0,
144145
) -> list[str] | tuple[list[str], list[tuple[int, int]]]:
@@ -151,6 +152,7 @@ def chunk(
151152
memoize (bool, optional): Whether to memoize the token counter. Defaults to `True`.
152153
offsets (bool, optional): Whether to return the start and end offsets of each chunk. Defaults to `False`.
153154
overlap (float | int | None, optional): The proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. Defaults to `None`, in which case no overlapping occurs.
155+
cache_maxsize (int | None, optional): The maximum number of text-token count pairs that can be stored in the token counter's cache. Defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.
154156
155157
Returns:
156158
list[str] | tuple[list[str], list[tuple[int, int]]]: A list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed, and, if `offsets` is `True`, a list of tuples of the form `(start, end)` where `start` is the index of the first character of the chunk in the original text and `end` is the index of the character after the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`."""
@@ -162,7 +164,7 @@ def chunk(
162164
# If this is the first call, memoize the token counter if memoization is enabled and reduce the effective chunk size if overlapping chunks.
163165
if is_first_call := not _recursion_depth:
164166
if memoize:
165-
token_counter = _memoized_token_counters.setdefault(token_counter, cache(token_counter))
167+
token_counter = _memoized_token_counters.setdefault(token_counter, lru_cache(cache_maxsize)(token_counter))
166168

167169
if overlap:
168170
# Make relative overlaps absolute and floor both relative and absolute overlaps to prevent ever having an overlap >= chunk_size.
@@ -377,6 +379,7 @@ def chunkerify(
377379
chunk_size: int | None = None,
378380
max_token_chars: int | None = None,
379381
memoize: bool = True,
382+
cache_maxsize: int | None = None,
380383
) -> Chunker:
381384
"""Construct a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.
382385
@@ -385,6 +388,7 @@ def chunkerify(
385388
chunk_size (int, optional): The maximum number of tokens a chunk may contain. Defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible otherwise a `ValueError` will be raised.
386389
max_token_chars (int, optional): The maximum numbers of characters a token may contain. Used to significantly speed up the token counting of long inputs. Defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
387390
memoize (bool, optional): Whether to memoize the token counter. Defaults to `True`.
391+
cache_maxsize (int, optional): The maximum number of text-token count pairs that can be stored in the token counter's cache. Defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.
388392
389393
Returns:
390394
Callable[[str | Sequence[str], bool, bool, bool, int | float | None], list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]]: A chunker that takes either a single text or a sequence of texts and returns, depending on whether multiple texts have been provided, a list or list of lists of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, and, if the optional `offsets` argument to the chunker is `True`, a list or lists of tuples of the form `(start, end)` where `start` is the index of the first character of a chunk in a text and `end` is the index of the character succeeding the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.
@@ -486,7 +490,7 @@ def faster_token_counter(text: str) -> int:
486490

487491
# Memoize the token counter if necessary.
488492
if memoize:
489-
token_counter = _memoized_token_counters.setdefault(token_counter, cache(token_counter))
493+
token_counter = _memoized_token_counters.setdefault(token_counter, lru_cache(cache_maxsize)(token_counter))
490494

491495
# Construct and return the chunker.
492496
return Chunker(chunk_size=chunk_size, token_counter=token_counter)

0 commit comments

Comments
 (0)