Skip to content

Commit d4ed011

Browse files
committed
Started benchmarking [semantic-text-splitter](https://pypi.org/project/semantic-text-splitter/) in parallel to ensure a fair comparison.
1 parent c1f5cb9 commit d4ed011

File tree

4 files changed

+9
-3
lines changed

4 files changed

+9
-3
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [2.2.1] - 2024-12-17
5+
### Changed
6+
- Started benchmarking [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) in parallel to ensure a fair comparison, courtesy of [@benbrandt](https://github.com/benbrandt) ([#17](https://github.com/umarbutler/semchunk/pull/12)).
7+
48
## [2.2.0] - 2024-07-12
59
### Changed
610
- Switched from having `chunkerify()` output a function to having it return an instance of the new `Chunker()` class which should not alter functionality in any way but will allow for the preservation of type hints, fixing [#7](https://github.com/umarbutler/semchunk/pull/7).
@@ -79,6 +83,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
7983
### Added
8084
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
8185

86+
[2.2.1]: https://github.com/umarbutler/semchunk/compare/v2.2.0...v2.2.1
8287
[2.2.0]: https://github.com/umarbutler/semchunk/compare/v2.1.0...v2.2.0
8388
[2.1.0]: https://github.com/umarbutler/semchunk/compare/v2.0.0...v2.1.0
8489
[2.0.0]: https://github.com/umarbutler/semchunk/compare/v1.0.1...v2.0.0

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
`semchunk` is a fast and lightweight Python library for splitting text into semantically meaningful chunks.
55

6-
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 90% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
6+
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 80% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
77

88
## Installation 📦
99
`semchunk` may be installed with `pip`:
@@ -114,7 +114,7 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
114114
1. All other characters.
115115
116116
## Benchmarks 📊
117-
On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.11.9, it takes `semchunk` 6.69 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 116.48 seconds to chunk the same texts into 512-token-long chunks — a difference of 94.26%.
117+
On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 2.87 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 25.03 seconds to chunk the same texts into 512-token-long chunks — a difference of 88.53%.
118118

119119
The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).
120120

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "semchunk"
7-
version = "2.2.0"
7+
version = "2.2.1"
88
authors = [
99
{name="Umar Butler", email="[email protected]"},
1010
]

tests/bench.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,5 +49,6 @@ def bench_sts(texts: list[str]) -> None:
4949

5050
if __name__ == '__main__':
5151
nltk.download('gutenberg')
52+
5253
for library, time_taken in bench().items():
5354
print(f'{library}: {time_taken:.2f}s')

0 commit comments

Comments
 (0)