Skip to content

Commit 75dde65

Browse files
committed
Added benchmarks.
1 parent 73b3ccc commit 75dde65

File tree

2 files changed

+35
-0
lines changed

2 files changed

+35
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
44
## [0.1.1] - 2023-11-07
55
### Added
66
- Added new test samples.
7+
- Added benchmarks.
78

89
### Changed
910
- Improved chunking performance.

tests/bench.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
import semchunk
2+
import semantic_text_splitter
3+
import test_semchunk
4+
import time
5+
6+
chunk_size = 512
7+
semantic_text_splitter_chunker = semantic_text_splitter.TiktokenTextSplitter('gpt-4')
8+
9+
def bench_semchunk(text: str) -> None:
10+
semchunk.chunk(text, chunk_size=chunk_size, token_counter=test_semchunk._token_counter)
11+
12+
def bench_semantic_text_splitter(text: str) -> None:
13+
semantic_text_splitter_chunker.chunks(text, chunk_size)
14+
15+
libraries = {
16+
'semchunk': bench_semchunk,
17+
#'semantic_text_splitter': bench_semantic_text_splitter,
18+
}
19+
20+
def bench() -> dict[str, float]:
21+
benchmarks = dict.fromkeys(libraries.keys(), 0)
22+
23+
for fileid in test_semchunk.gutenberg.fileids():
24+
sample = test_semchunk.gutenberg.raw(fileid)
25+
for library, function in libraries.items():
26+
start = time.time()
27+
function(sample)
28+
benchmarks[library] += time.time() - start
29+
30+
return benchmarks
31+
32+
if __name__ == '__main__':
33+
for library, time_taken in bench().items():
34+
print(f'{library}: {time_taken:.2f}s')

0 commit comments

Comments
 (0)