Skip to content

Commit 5491fdd

Browse files
committed
Memoized chunk().
1 parent 70ee509 commit 5491fdd

File tree

3 files changed

+8
-1
lines changed

3 files changed

+8
-1
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [Unreleased]
5+
### Added
6+
- Memoized `chunk()`.
7+
48
## [0.2.0] - 2023-11-07
59
### Added
610
- Added the `memoize` argument to `chunk()`, which memoizes token counters by default to significantly improve performance.

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
6363
1. Word joiners (`/`, `\`, `–`, `&` and `-`); and
6464
1. All other characters.
6565
66+
`semchunk` also relies on memoization to cache the results of token counters and the `chunk()` function, thereby improving performance.
67+
6668
## Benchmarks 📊
6769
On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes `semchunk` 25.29 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 51.65 seconds to chunk the same texts into 512-token-long chunks — a difference of 77.35%.
6870

src/semchunk/semchunk.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
'/', '\\', '–', '&', '-', # Word joiners.
1212
)
1313
"""A tuple of semantically meaningful non-whitespace splitters that may be used to chunk texts, ordered from most desirable to least desirable."""
14-
14+
1515
def _split_text(text: str) -> tuple[str, bool, list[str]]:
1616
"""Split text using the most semantically meaningful splitter possible."""
1717

@@ -45,6 +45,7 @@ def _split_text(text: str) -> tuple[str, bool, list[str]]:
4545
# Return the splitter and the split text.
4646
return splitter, splitter_is_whitespace, text.split(splitter)
4747

48+
@cache
4849
def chunk(text: str, chunk_size: int, token_counter: callable, memoize: bool=True, _recursion_depth: int = 0) -> list[str]:
4950
"""Split text into semantically meaningful chunks of a specified size as determined by the provided token counter.
5051

0 commit comments

Comments
 (0)