You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,10 @@
1
1
## Changelog 🔄
2
2
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
3
3
4
+
## [2.2.0] - 2024-07-12
5
+
### Changed
6
+
- Switched from having `chunkerify()` output a function to having it return an instance of the new `Chunker()` class which should not alter functionality in any way but will allow for the preservation of type hints, fixing [#7](https://github.com/umarbutler/semchunk/pull/7).
7
+
4
8
## [2.1.0] - 2024-06-20
5
9
### Fixed
6
10
- Ceased memoizing `chunk()` (but not token counters) due to the fact that cached outputs of memoized functions are shallow rather than deep copies of original outputs, meaning that if one were to chunk a text and then chunk that same text again and then modify one of the chunks outputted by the first call, the chunks outputted by the second call would also be modified. This behaviour is not expected and therefore undesirable. The memoization of token counters is not impacted as they output immutable objects, namely, integers.
Copy file name to clipboardExpand all lines: README.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,12 +66,14 @@ def chunkerify(
66
66
67
67
`memoize` flags whether to memoize the token counter. It defaults to `True`.
68
68
69
-
This function returns a callable that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
69
+
This function returns a chunker that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
70
70
71
-
The resulting chunker function can also be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.
71
+
The resulting chunker can be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.
72
72
73
73
It is also possible to pass a `progress` argument which, if set to `True` and multiple texts are passed, will display a progress bar.
74
74
75
+
Technically, the chunker will be an instance of the `semchunk.Chunker` class to assist with type hinting, though this should have no impact on how it can be used.
"""Split text or texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.
174
+
175
+
Args:
176
+
text_or_texts (str | Sequence[str]): The text or texts to be chunked.
177
+
178
+
Returns:
179
+
list[str] | list[list[str]]: If a single text has been provided, a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
180
+
processes (int, optional): The number of processes to use when chunking multiple texts. Defaults to `1` in which case chunking will occur in the main process.
181
+
progress (bool, optional): Whether to display a progress bar when chunking multiple texts. Defaults to `False`."""
): # NOTE The output of `chunkerify()` is not type hinted because it causes `vscode` to overwrite the signature and docstring of the outputted chunker with the type hint.
200
+
)->Chunker:
161
201
"""Construct a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.
162
202
163
203
Args:
@@ -167,11 +207,13 @@ def chunkerify(
167
207
memoize (bool, optional): Whether to memoize the token counter. Defaults to `True`.
168
208
169
209
Returns:
170
-
Callable[[str | Sequence[str], bool, bool], list[str] | list[list[str]]]: A function that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
210
+
Callable[[str | Sequence[str], bool, bool], list[str] | list[list[str]]]: A chunker that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
211
+
212
+
The resulting chunker can be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.
171
213
172
-
The resulting chunker function can also be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.
214
+
It is also possible to pass a `progress` argument which, if set to `True` and multiple texts are passed, will display a progress bar.
173
215
174
-
It is also possible to pass a `progress` argument which, if set to `True` and multiple texts are passed, will display a progress bar."""
216
+
Technically, the chunker will be an instance of the `semchunk.Chunker` class to assist with type hinting, though this should have no impact on how it can be used."""
175
217
176
218
# If the provided tokenizer is a string, try to load it with either `tiktoken` or `transformers` or raise an error if neither is available.
"""Split text or texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.
268
-
269
-
Args:
270
-
text_or_texts (str | Sequence[str]): The text or texts to be chunked.
271
-
272
-
Returns:
273
-
list[str] | list[list[str]]: If a single text has been provided, a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.
274
-
processes (int, optional): The number of processes to use when chunking multiple texts. Defaults to `1` in which case chunking will occur in the main process.
275
-
progress (bool, optional): Whether to display a progress bar when chunking multiple texts. Defaults to `False`."""
0 commit comments