Skip to content

Commit 0a21f12

Browse files
committed
Added the memoize argument to chunk(), which memoizes token counters by default to significantly improve performance.
1 parent f819f12 commit 0a21f12

File tree

3 files changed

+23
-3
lines changed

3 files changed

+23
-3
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,13 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [0.2.0] - 2023-11-07
5+
### Added
6+
- Added the `memoize` argument to `chunk()`, which memoizes token counters by default to significantly improve performance.
7+
8+
### Changed
9+
- Improved chunking performance.
10+
411
## [0.1.2] - 2023-11-07
512
### Fixed
613
- Fixed links in the README.
@@ -18,6 +25,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
1825
### Added
1926
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
2027

28+
[0.2.0]: https://github.com/umarbutler/semchunk/compare/v0.1.2...v0.2.0
2129
[0.1.2]: https://github.com/umarbutler/semchunk/compare/v0.1.1...v0.1.2
2230
[0.1.1]: https://github.com/umarbutler/semchunk/compare/v0.1.0...v0.1.1
2331
[0.1.0]: https://github.com/umarbutler/semchunk/releases/tag/v0.1.0

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ def chunk(
3030
text: str,
3131
chunk_size: int,
3232
token_counter: callable,
33+
memoize: bool=True
3334
) -> list[str]
3435
```
3536

@@ -41,6 +42,8 @@ def chunk(
4142

4243
`token_counter` is a callable that takes a string and returns the number of tokens in it.
4344

45+
`memoize` flags whether to memoise the token counter. It defaults to `True`.
46+
4447
This function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed.
4548

4649
## How It Works 🔍

src/semchunk/semchunk.py

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
import re
2+
from functools import cache
23

3-
NON_WHITESPACE_SEMANTIC_SPLITTERS = (
4+
_memoised_token_counters = {}
5+
"""A map of token counters to their memoised versions."""
6+
7+
_NON_WHITESPACE_SEMANTIC_SPLITTERS = (
48
'.', '?', '!', '*', # Sentence terminators.
59
';', ',', '(', ')', '[', ']', "“", "”", '‘', '’', "'", '"', '`', # Clause separators.
610
':', '—', '…', # Sentence interrupters.
@@ -29,7 +33,7 @@ def _split_text(text: str) -> tuple[str, bool, list[str]]:
2933

3034
else:
3135
# Identify the most desirable semantically meaningful non-whitespace splitter present in the text.
32-
for splitter in NON_WHITESPACE_SEMANTIC_SPLITTERS:
36+
for splitter in _NON_WHITESPACE_SEMANTIC_SPLITTERS:
3337
if splitter in text:
3438
splitter_is_whitespace = False
3539
break
@@ -41,16 +45,21 @@ def _split_text(text: str) -> tuple[str, bool, list[str]]:
4145
# Return the splitter and the split text.
4246
return splitter, splitter_is_whitespace, text.split(splitter)
4347

44-
def chunk(text: str, chunk_size: int, token_counter: callable, _recursion_depth: int = 0) -> list[str]:
48+
def chunk(text: str, chunk_size: int, token_counter: callable, memoize: bool=True, _recursion_depth: int = 0) -> list[str]:
4549
"""Split text into semantically meaningful chunks of a specified size as determined by the provided token counter.
4650
4751
Args:
4852
text (str): The text to be chunked.
4953
chunk_size (int): The maximum number of tokens a chunk may contain.
5054
token_counter (callable): A callable that takes a string and returns the number of tokens in it.
55+
memoize (bool, optional): Whether to memoise the token counter. Defaults to True.
5156
5257
Returns:
5358
list[str]: A list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed."""
59+
60+
# If this is not a recursive call and memoization is enabled, overwrite the `token_counter` with a memoised version of itself.
61+
if not _recursion_depth and memoize:
62+
token_counter = _memoised_token_counters.setdefault(token_counter, cache(token_counter))
5463

5564
# Split the text using the most semantically meaningful splitter possible.
5665
splitter, splitter_is_whitespace, splits = _split_text(text)

0 commit comments

Comments
 (0)