You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-3Lines changed: 12 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,8 @@
3
3
4
4
`semchunk` is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
5
5
6
+
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) (see [How It Works 🔍](#how-it-works-🔍)) and is also over 60% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](#benchmarks-📊)).
7
+
6
8
## Installation 📦
7
9
`semchunk` may be installed with `pip`:
8
10
```bash
@@ -11,12 +13,14 @@ pip install semchunk
11
13
12
14
## Usage 👩💻
13
15
The code snippet below demonstrates how text can be chunked with `semchunk`:
14
-
15
16
```python
16
17
>>>import semchunk
18
+
>>>import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.
17
19
>>> text ='The quick brown fox jumps over the lazy dog.'
18
-
>>> token_counter =lambdatext: len(text.split()) # If using `tiktoken`, you may replace this with `token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text))`.
>>> token_counter =lambdatext: len(tiktoken.encoding_for_model(model).encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']
21
25
```
22
26
@@ -56,5 +60,10 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
56
60
1. Word joiners (`/`, `\`, `–`, `&` and `-`); and
57
61
1. All other characters.
58
62
63
+
## Benchmarks 📊
64
+
On a desktop with a Ryzen 3600, 64GB of RAM, Windows 11and Python 3.12.0, it takes `semchunk`35.75 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 50.5 seconds to chunk the same texts into 512-token-long chunks — a difference of 67.65%.
65
+
66
+
The code used to benchmark `semchunk`and`semantic-text-splitter`is available [here](tests/bench.py).
67
+
59
68
## Licence 📄
60
69
This library is licensed under the [MIT License](https://github.com/umarbutler/semchunk/blob/main/LICENSE).
0 commit comments