Skip to content

Commit ee802ed

Browse files
committed
Added support for multiprocessing.
1 parent 72b03b9 commit ee802ed

File tree

5 files changed

+48
-10
lines changed

5 files changed

+48
-10
lines changed

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,13 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [2.0.0] - 2024-06-19
5+
### Added
6+
- Added support for multiprocessing through the `processes` argument passable to chunkers constructed by `chunkerify()`.
7+
8+
### Removed
9+
- No longer guaranteed that `semchunk` is pure Python.
10+
411
## [1.0.1] - 2024-06-02
512
### Fixed
613
- Documented the `progress` argument in the docstring for `chunkerify()` and its type hint in the README.

README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,10 @@ chunker = semchunk.chunkerify('umarbutler/emubert', chunk_size) or \
3636
# chunks or a list of lists of chunks, respectively.
3737
assert chunker(text) == ['The quick', 'brown', 'fox', 'jumps', 'over the', 'lazy', 'dog.']
3838
assert chunker([text], progress = True) == [['The quick', 'brown', 'fox', 'jumps', 'over the', 'lazy', 'dog.']]
39+
40+
# If you have a large number of texts to chunk and speed is a concern, you can also enable
41+
# multiprocessing by setting `processes` to a number greater than 1.
42+
assert chunker([text], processes = 2) == [['The quick', 'brown', 'fox', 'jumps', 'over the', 'lazy', 'dog.']]
3943
```
4044

4145
### Chunkerify
@@ -46,7 +50,7 @@ def chunkerify(
4650
chunk_size: int = None,
4751
max_token_chars: int = None,
4852
memoize: bool = True,
49-
) -> Callable[[str | Sequence[str], bool], list[str] | list[list[str]]]:
53+
) -> Callable[[str | Sequence[str], bool, bool], list[str] | list[list[str]]]:
5054
```
5155

5256
`chunkerify()` constructs a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.
@@ -59,7 +63,11 @@ def chunkerify(
5963

6064
`memoize` flags whether to memoize the token counter. It defaults to `True`.
6165

62-
This function returns a callable that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts. The callable can also be passed a `progress` argument which if set to `True` and multiple texts are passed, will display a progress bar.
66+
This function returns a callable that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
67+
68+
The resulting chunker function can also be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.
69+
70+
It is also possible to pass a `progress` argument which, if set to `True` and multiple texts are passed, will display a progress bar.
6371

6472
### Chunk
6573
```python

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "semchunk"
7-
version = "1.0.1"
7+
version = "2.0.0"
88
authors = [
99
{name="Umar Butler", email="[email protected]"},
1010
]
@@ -45,6 +45,7 @@ classifiers = [
4545
]
4646
dependencies = [
4747
"tqdm",
48+
"mpire",
4849
]
4950

5051
[project.urls]

src/semchunk/semchunk.py

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
from itertools import accumulate
1010
from contextlib import suppress
1111

12+
import mpire
13+
1214
from tqdm import tqdm
1315

1416
if TYPE_CHECKING:
@@ -168,7 +170,11 @@ def chunkerify(
168170
memoize (bool, optional): Whether to memoize the token counter. Defaults to `True`.
169171
170172
Returns:
171-
Callable[[str | Sequence[str], bool], list[str] | list[list[str]]]: A function that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts. The function can also be passed a `progress` argument which if set to `True` and multiple texts are passed, will display a progress bar."""
173+
Callable[[str | Sequence[str], bool, bool], list[str] | list[list[str]]]: A function that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
174+
175+
The resulting chunker function can also be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.
176+
177+
It is also possible to pass a `progress` argument which, if set to `True` and multiple texts are passed, will display a progress bar."""
172178

173179
# If the provided tokenizer is a string, try to load it with either `tiktoken` or `transformers` or raise an error if neither is available.
174180
if isinstance(tokenizer_or_token_counter, str):
@@ -251,24 +257,36 @@ def faster_token_counter(text: str) -> int:
251257
if memoize:
252258
token_counter = _memoized_token_counters.setdefault(token_counter, cache(token_counter))
253259

260+
# Construct a chunking function that passes the chunk size and token counter to `chunk()`.
261+
def chunking_function(text: str) -> list[str]:
262+
return chunk(text, chunk_size, token_counter, memoize = False)
263+
254264
# Construct and return the chunker.
255-
def chunker(text_or_texts: str | Sequence[str], progress: bool = False) -> list[str] | list[list[str]]:
265+
def chunker(
266+
text_or_texts: str | Sequence[str],
267+
processes: int = 1,
268+
progress: bool = False,
269+
) -> list[str] | list[list[str]]:
256270
"""Split text or texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.
257271
258272
Args:
259273
text_or_texts (str | Sequence[str]): The text or texts to be chunked.
260274
261275
Returns:
262276
list[str] | list[list[str]]: If a single text has been provided, a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
277+
processes (int, optional): The number of processes to use when chunking multiple texts. Defaults to `1` in which case chunking will occur in the main process.
263278
progress (bool, optional): Whether to display a progress bar when chunking multiple texts. Defaults to `False`."""
264279

265280
if isinstance(text_or_texts, str):
266-
return chunk(text_or_texts, chunk_size, token_counter, memoize = False)
281+
return chunking_function(text_or_texts)
267282

268-
if progress:
269-
return [chunk(text, chunk_size, token_counter, memoize = False) for text in tqdm(text_or_texts)]
283+
if progress and processes == 1:
284+
text_or_texts = tqdm(text_or_texts)
270285

271-
else:
272-
return [chunk(text, chunk_size, token_counter, memoize = False) for text in text_or_texts]
286+
if processes == 1:
287+
return [chunking_function(text) for text in text_or_texts]
288+
289+
with mpire.WorkerPool(processes, use_dill = True) as pool:
290+
return pool.map(chunking_function, text_or_texts, progress_bar = progress)
273291

274292
return chunker

tests/test_semchunk.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,10 @@ def tiktoken_token_counter(text: str) -> int:
4848
chunker = semchunk.chunkerify(tiktoken_token_counter, 4)
4949
assert chunker(['ThisIs\tATest.', 'ThisIs\tATest.']) == [['ThisIs', 'ATest.'], ['ThisIs', 'ATest.']]
5050

51+
# Test chunking multiple texts with multiple processes.
52+
chunker = semchunk.chunkerify(tiktoken_token_counter, 4)
53+
assert chunker(['ThisIs\tATest.', 'ThisIs\tATest.'], processes = 2) == [['ThisIs', 'ATest.'], ['ThisIs', 'ATest.']]
54+
5155
# Test using a `transformers` tokenizer.
5256
chunker = semchunk.chunkerify(transformers_tokenizer)
5357
assert chunker('ThisIs\tATest.') == ['ThisIs\tATest.']

0 commit comments

Comments
 (0)