Skip to content

Commit 192c47f

Browse files
committed
Improved documentation.
1 parent 75dde65 commit 192c47f

File tree

1 file changed

+12
-3
lines changed

1 file changed

+12
-3
lines changed

README.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33

44
`semchunk` is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
55

6+
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) (see [How It Works 🔍](#how-it-works-🔍)) and is also over 60% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](#benchmarks-📊)).
7+
68
## Installation 📦
79
`semchunk` may be installed with `pip`:
810
```bash
@@ -11,12 +13,14 @@ pip install semchunk
1113

1214
## Usage 👩‍💻
1315
The code snippet below demonstrates how text can be chunked with `semchunk`:
14-
1516
```python
1617
>>> import semchunk
18+
>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.
1719
>>> text = 'The quick brown fox jumps over the lazy dog.'
18-
>>> token_counter = lambda text: len(text.split()) # If using `tiktoken`, you may replace this with `token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text))`.
19-
>>> semchunk.chunk(text, chunk_size=2, token_counter=token_counter)
20+
>>> chunk_size = 2 # A low chunk size is used here for demo purposes.
21+
>>> encoder = tiktoken.encoding_for_model('gpt-4')
22+
>>> token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.
23+
>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)
2024
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']
2125
```
2226

@@ -56,5 +60,10 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
5660
1. Word joiners (`/`, `\`, `–`, `&` and `-`); and
5761
1. All other characters.
5862
63+
## Benchmarks 📊
64+
On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes `semchunk` 35.75 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 50.5 seconds to chunk the same texts into 512-token-long chunks — a difference of 67.65%.
65+
66+
The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](tests/bench.py).
67+
5968
## Licence 📄
6069
This library is licensed under the [MIT License](https://github.com/umarbutler/semchunk/blob/main/LICENSE).

0 commit comments

Comments
 (0)