Skip to content

Commit d97d006

Browse files
committed
Ceased memoizing chunk() (but not token counters).
1 parent 8ed33e3 commit d97d006

File tree

3 files changed

+7
-5
lines changed

3 files changed

+7
-5
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [2.1.0] - 2024-06-20
5+
### Fixed
6+
- Ceased memoizing `chunk()` (but not token counters) due to the fact that cached outputs of memoized functions are shallow rather than deep copies of original outputs, meaning that if one were to chunk a text and then chunk that same text again and then modify one of the chunks outputted by the first call, the chunks outputted by the second call would also be modified. This behaviour is not expected and therefore undesirable. The memoization of token counters is not impacted as they output immutable objects, namely, integers.
7+
48
## [2.0.0] - 2024-06-19
59
### Added
610
- Added support for multiprocessing through the `processes` argument passable to chunkers constructed by `chunkerify()`.
@@ -71,6 +75,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
7175
### Added
7276
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
7377

78+
[2.1.0]: https://github.com/umarbutler/semchunk/compare/v2.0.0...v2.1.0
7479
[2.0.0]: https://github.com/umarbutler/semchunk/compare/v1.0.1...v2.0.0
7580
[1.0.1]: https://github.com/umarbutler/semchunk/compare/v1.0.0...v1.0.1
7681
[1.0.0]: https://github.com/umarbutler/semchunk/compare/v0.3.2...v1.0.0

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "semchunk"
7-
version = "2.0.0"
7+
version = "2.1.0"
88
authors = [
99
{name="Umar Butler", email="[email protected]"},
1010
]

src/semchunk/semchunk.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
from bisect import bisect_left
77
from typing import Callable, Sequence, TYPE_CHECKING
8-
from functools import cache, wraps
8+
from functools import cache
99
from itertools import accumulate
1010
from contextlib import suppress
1111

@@ -151,9 +151,6 @@ def chunk(
151151

152152
return chunks
153153

154-
# Memoize the `chunk` function, preserving its signature and docstring.
155-
chunk = wraps(chunk)(cache(chunk))
156-
157154
def chunkerify(
158155
tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer \
159156
| tokenizers.Tokenizer | Callable[[str], int],

0 commit comments

Comments
 (0)