Memoized chunk().

umarbutler · umarbutler · commit 5491fddbc8f7 · 2023-11-08T18:13:31.000+11:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,10 @@
 ## Changelog 🔄
 All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [Unreleased]
+### Added
+- Memoized `chunk()`.
+
 ## [0.2.0] - 2023-11-07
 ### Added
 - Added the `memoize` argument to `chunk()`, which memoizes token counters by default to significantly improve performance.
diff --git a/README.md b/README.md
@@ -63,6 +63,8 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
 1. Word joiners (`/`, `\`, `–`, `&` and `-`); and
 1. All other characters.
 
+`semchunk` also relies on memoization to cache the results of token counters and the `chunk()` function, thereby improving performance.
+
 ## Benchmarks 📊
 On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes `semchunk` 25.29 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 51.65 seconds to chunk the same texts into 512-token-long chunks — a difference of 77.35%.
 
diff --git a/src/semchunk/semchunk.py b/src/semchunk/semchunk.py
@@ -11,7 +11,7 @@
     '/', '\\', '–', '&', '-', # Word joiners.
 )
 """A tuple of semantically meaningful non-whitespace splitters that may be used to chunk texts, ordered from most desirable to least desirable."""
-    
+
 def _split_text(text: str) -> tuple[str, bool, list[str]]:
     """Split text using the most semantically meaningful splitter possible."""
     
@@ -45,6 +45,7 @@ def _split_text(text: str) -> tuple[str, bool, list[str]]:
     # Return the splitter and the split text.
     return splitter, splitter_is_whitespace, text.split(splitter)
 
+@cache
 def chunk(text: str, chunk_size: int, token_counter: callable, memoize: bool=True, _recursion_depth: int = 0) -> list[str]:
     """Split text into semantically meaningful chunks of a specified size as determined by the provided token counter.