Skip to content

Commit da0b25a

Browse files
committed
Ensured that the memoize argument is passed back to chunk() in recursive calls.
1 parent 947bbad commit da0b25a

File tree

5 files changed

+9
-4
lines changed

5 files changed

+9
-4
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [0.2.2] - 2024-02-05
5+
### Fixed
6+
- Ensured that the `memoize` argument is passed back to `chunk()` in recursive calls.
7+
48
## [0.2.1] - 2023-11-09
59
### Added
610
- Memoized `chunk()`.
@@ -32,6 +36,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
3236
### Added
3337
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
3438

39+
[0.2.2]: https://github.com/umarbutler/semchunk/compare/v0.2.1...v0.2.2
3540
[0.2.1]: https://github.com/umarbutler/semchunk/compare/v0.2.0...v0.2.1
3641
[0.2.0]: https://github.com/umarbutler/semchunk/compare/v0.1.2...v0.2.0
3742
[0.1.2]: https://github.com/umarbutler/semchunk/compare/v0.1.1...v0.1.2

LICENCE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Copyright (c) 2023 Umar Butler
1+
Copyright (c) 2024 Umar Butler
22

33
Permission is hereby granted, free of charge, to any person obtaining a copy
44
of this software and associated documentation files (the "Software"), to deal

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
6666
`semchunk` also relies on memoization to cache the results of token counters and the `chunk()` function, thereby improving performance.
6767

6868
## Benchmarks 📊
69-
On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes `semchunk` 25.29 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 51.65 seconds to chunk the same texts into 512-token-long chunks — a difference of 77.35%.
69+
On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes `semchunk` 24.41s seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 48.01 seconds to chunk the same texts into 512-token-long chunks — a difference of 77.35%.
7070

7171
The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).
7272

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "semchunk"
7-
version = "0.2.1"
7+
version = "0.2.2"
88
authors = [
99
{name="Umar Butler", email="[email protected]"},
1010
]

src/semchunk/semchunk.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ def chunk(text: str, chunk_size: int, token_counter: callable, memoize: bool=Tru
7777

7878
# If the split is over the chunk size, recursively chunk it.
7979
if token_counter(split) > chunk_size:
80-
chunks.extend(chunk(split, chunk_size, token_counter=token_counter, _recursion_depth=_recursion_depth+1))
80+
chunks.extend(chunk(split, chunk_size, token_counter=token_counter, memoize=memoize, _recursion_depth=_recursion_depth+1))
8181

8282
# If the split is equal to or under the chunk size, merge it with all subsequent splits until the chunk size is reached.
8383
else:

0 commit comments

Comments
 (0)