Skip to content

Commit 2d40c38

Browse files
committed
Updated documentation
1 parent 5573297 commit 2d40c38

File tree

1 file changed

+62
-32
lines changed

1 file changed

+62
-32
lines changed

README.md

Lines changed: 62 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,86 +1,107 @@
1-
# semchunk
1+
<div align='center'>
2+
3+
# semchunk🧩
24
<a href="https://pypi.org/project/semchunk/" alt="PyPI Version"><img src="https://img.shields.io/pypi/v/semchunk"></a> <a href="https://github.com/umarbutler/semchunk/actions/workflows/ci.yml" alt="Build Status"><img src="https://img.shields.io/github/actions/workflow/status/umarbutler/semchunk/ci.yml?branch=main"></a> <a href="https://app.codecov.io/gh/umarbutler/semchunk" alt="Code Coverage"><img src="https://img.shields.io/codecov/c/github/umarbutler/semchunk"></a> <a href="https://pypistats.org/packages/semchunk" alt="Downloads"><img src="https://img.shields.io/pypi/dm/semchunk"></a>
35

4-
`semchunk` is a fast and lightweight Python library for splitting text into semantically meaningful chunks.
6+
</div>
7+
8+
`semchunk` is a fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.
59

6-
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 80% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
10+
It has built-in support for tokenizers from OpenAI's `tiktoken` and Hugging Face's `transformers` and `tokenizers` libraries, in addition to supporting custom tokenizers and token counters. It can also overlap chunks as well as return their offsets.
11+
12+
Powered by an efficient yet highly accurate chunking algorithm ([How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)), `semchunk` produces chunks that are more semantically meaningful than regular token and recursive character chunkers like `langchain`'s `RecursiveCharacterTextSplitter`, while also being 80% faster than its closest alternative, `semantic-text-splitter` ([Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
713

814
## Installation 📦
9-
`semchunk` may be installed with `pip`:
15+
`semchunk` can be installed with `pip`:
1016
```bash
1117
pip install semchunk
1218
```
1319

20+
`semchunk` is also available on `conda-forge`:
21+
```bash
22+
conda install conda-forge::semchunk
23+
# or
24+
conda install -c conda-forge semchunk
25+
```
26+
27+
In addition, [@dominictarro](https://github.com/dominictarro) maintains a Rust port of `semchunk` named [`semchunk-rs`](https://crates.io/crates/semchunk-rs).
28+
1429
## Usage 👩‍💻
15-
The code snippet below demonstrates how text can be chunked with `semchunk`:
30+
The code snippet below demonstrates how to chunk text with `semchunk`:
1631
```python
32+
1733
import semchunk
18-
from transformers import AutoTokenizer # Neither `transformers` nor `tiktoken` are required,
19-
import tiktoken # they are here for demonstration purposes.
34+
import tiktoken # `transformers` and `tiktoken` are not required.
35+
from transformers import AutoTokenizer # They're just here for demonstration purposes.
2036

21-
chunk_size = 2 # A low chunk size is used here for demonstration purposes. Keep in mind that
22-
# `semchunk` doesn't take special tokens into account unless you're using a
23-
# custom token counter, so you probably want to deduct your chunk size by the
24-
# number of special tokens added by your tokenizer.
37+
chunk_size = 4
2538
text = 'The quick brown fox jumps over the lazy dog.'
2639

27-
# As you can see below, `semchunk.chunkerify` will accept the names of all OpenAI models, OpenAI
28-
# `tiktoken` encodings and Hugging Face models (in that order of precedence), along with custom
29-
# tokenizers that have an `encode()` method (such as `tiktoken`, `transformers` and `tokenizers`
30-
# tokenizers) and finally any function that can take a text and return the number of tokens in it.
40+
# You can construct a chunker with `semchunk.chunkerify()` by passing the name of an OpenAI model,
41+
# OpenAI `tiktoken` encoding or Hugging Face model, or a custom tokenizer that has an `encode()`
42+
# method (like a `tiktoken`, `transformers` or `tokenizers` tokenizer) or a custom token counting
43+
# function that takes a text and returns the number of tokens in it.
3144
chunker = semchunk.chunkerify('umarbutler/emubert', chunk_size) or \
3245
semchunk.chunkerify('gpt-4', chunk_size) or \
3346
semchunk.chunkerify('cl100k_base', chunk_size) or \
3447
semchunk.chunkerify(AutoTokenizer.from_pretrained('umarbutler/emubert'), chunk_size) or \
3548
semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), chunk_size) or \
3649
semchunk.chunkerify(lambda text: len(text.split()), chunk_size)
3750

38-
# The resulting `chunker` can take and chunk a single text or a list of texts, returning a list of
39-
# chunks or a list of lists of chunks, respectively.
40-
assert chunker(text) == ['The quick', 'brown', 'fox', 'jumps', 'over the', 'lazy', 'dog.']
41-
assert chunker([text], progress = True) == [['The quick', 'brown', 'fox', 'jumps', 'over the', 'lazy', 'dog.']]
51+
# If you give the resulting chunker a single text, it'll return a list of chunks. If you give it a
52+
# list of texts, it'll return a list of lists of chunks.
53+
assert chunker(text) == ['The quick brown fox', 'jumps over the', 'lazy dog.']
54+
assert chunker([text], progress = True) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]
55+
56+
# If you have a lot of texts and you want to speed things up, you can enable multiprocessing by
57+
# setting `processes` to a number greater than 1.
58+
assert chunker([text], processes = 2) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]
4259

43-
# If you have a large number of texts to chunk and speed is a concern, you can also enable
44-
# multiprocessing by setting `processes` to a number greater than 1.
45-
assert chunker([text], processes = 2) == [['The quick', 'brown', 'fox', 'jumps', 'over the', 'lazy', 'dog.']]
60+
# You can also pass a `offsets` argument to return the offsets of chunks, as well as an `overlap`
61+
# argument to overlap chunks by a ratio (if < 1) or an absolute number of tokens (if >= 1).
62+
chunks, offsets = chunker(text, offsets = True, overlap = 0.5)
4663
```
4764

48-
### Chunkerify
65+
### `chunkerify()`
4966
```python
5067
def chunkerify(
5168
tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer | \
5269
tokenizers.Tokenizer | Callable[[str], int],
5370
chunk_size: int = None,
5471
max_token_chars: int = None,
5572
memoize: bool = True,
56-
) -> Callable[[str | Sequence[str], bool, bool], list[str] | list[list[str]]]:
73+
) -> Callable[[str | Sequence[str], bool, bool, bool, int | float | None], list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]]:
5774
```
5875

5976
`chunkerify()` constructs a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.
6077

61-
`tokenizer_or_token_counter` is either: the name of a `tiktoken` or `transformers` tokenizer (with priority given to the former); a tokenizer that possesses an `encode` attribute (eg, a `tiktoken`, `transformers` or `tokenizers` tokenizer); or a token counter that returns the number of tokens in a input.
78+
`tokenizer_or_token_counter` is either: the name of a `tiktoken` or `transformers` tokenizer (with priority given to the former); a tokenizer that possesses an `encode` attribute (e.g., a `tiktoken`, `transformers` or `tokenizers` tokenizer); or a token counter that returns the number of tokens in an input.
6279

63-
`chunk_size` is the maximum number of tokens a chunk may contain. It defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible otherwise a `ValueError` will be raised.
80+
`chunk_size` is the maximum number of tokens a chunk may contain. It defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible, otherwise a `ValueError` will be raised.
6481

6582
`max_token_chars` is the maximum numbers of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
6683

6784
`memoize` flags whether to memoize the token counter. It defaults to `True`.
6885

69-
This function returns a chunker that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
86+
This function returns a chunker that takes either a single text or a sequence of texts and returns, depending on whether multiple texts have been provided, a list or list of lists of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, and, if the optional `offsets` argument to the chunker is `True`, a list or lists of tuples of the form `(start, end)` where `start` is the index of the first character of a chunk in a text and `end` is the index of the character succeeding the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.
7087

7188
The resulting chunker can be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.
7289

7390
It is also possible to pass a `progress` argument which, if set to `True` and multiple texts are passed, will display a progress bar.
7491

75-
Technically, the chunker will be an instance of the `semchunk.Chunker` class to assist with type hinting, though this should have no impact on how it can be used.
92+
As described above, the `offsets` argument, if set to `True`, will cause the chunker to return the start and end offsets of each chunk.
93+
94+
The chunker accepts an `overlap` argument that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to `None`, in which case no overlapping occurs.
7695

77-
### Chunk
96+
### `chunk()`
7897
```python
7998
def chunk(
8099
text: str,
81100
chunk_size: int,
82101
token_counter: Callable,
83102
memoize: bool = True,
103+
offsets: bool = False,
104+
overlap: float | int | None = None,
84105
) -> list[str]
85106
```
86107

@@ -94,14 +115,19 @@ def chunk(
94115

95116
`memoize` flags whether to memoize the token counter. It defaults to `True`.
96117

97-
This function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed.
118+
`offsets` flags whether to return the start and end offsets of each chunk. It defaults to `False`.
119+
120+
`overlap` specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to `None`, in which case no overlapping occurs.
121+
122+
This function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed, and, if `offsets` is `True`, a list of tuples of the form `(start, end)` where `start` is the index of the first character of the chunk in the original text and `end` is the index of the character after the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.
98123

99124
## How It Works 🔍
100125
`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
101126
1. Splits text using the most semantically meaningful splitter possible;
102127
1. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
103-
1. Merges any chunks that are under the chunk size back together until the chunk size is reached; and
104-
1. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.
128+
1. Merges any chunks that are under the chunk size back together until the chunk size is reached;
129+
1. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
130+
1. Since version XXXX, excludes chunks consisting entirely of whitespace characters.
105131

106132
To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
107133
1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
@@ -113,6 +139,10 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
113139
1. Word joiners (`/`, `\`, `–`, `&` and `-`); and
114140
1. All other characters.
115141
142+
If overlapping chunks have been requested, `semchunk` also:
143+
1. Internally reduces the chunk size to `min(overlap, chunk_size - overlap)` (`overlap` being computed as `floor(chunk_size * overlap)` for relative overlaps and `min(overlap, chunk_size - 1)` for absolute overlaps); and
144+
1. Merges every `floor(original_chunk_size / reduced_chunk_size)` chunks starting from the first chunk and then jumping by `floor((original_chunk_size - overlap) / reduced_chunk_size)` chunks until the last chunk is reached.
145+
116146
## Benchmarks 📊
117147
On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 2.87 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 25.03 seconds to chunk the same texts into 512-token-long chunks — a difference of 88.53%.
118148

0 commit comments

Comments
 (0)