Releases · isaacus-dev/semchunk · GitHub

13 Feb 23:02

umarbutler

v3.0.4

Fixed

Fixed bug where attempting to chunk only whitespace characters would raise ValueError: not enough values to unpack (expected 2, got 0) (ScrapeGraphAI/Scrapegraph-ai#893).

Assets 2

13 Feb 05:54

umarbutler

v3.0.3

Fixed

Fixed isaacus/emubert mistakenly being set to isaacus-dev/emubert in the README and tests.

Assets 2

13 Feb 05:47

umarbutler

v3.0.2

This release was yanked due to a typo.

Fixed

Significantly sped up chunking very long texts with little to no variation in levels of whitespace used (fixing #8) and, in the process, also slightly improved overall performance.

Changed

Transferred semchunk to Isaacus.
Began formatting with Ruff.

Assets 2

10 Jan 02:01

umarbutler

v3.0.1

Fixed

Fixed a bug where attempting to chunk an empty text would raise a ValueError.

Assets 2

31 Dec 04:40

umarbutler

v3.0.0

Added

Added an offsets argument to chunk() and Chunker.__call__() that specifies whether to return the start and end offsets of each chunk (#9). The argument defaults to False.
Added an overlap argument to chunk() and Chunker.__call__() that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap (#1). The argument defaults to None, in which case no overlapping occurs.
Added an undocumented, private _make_chunk_function() method to the Chunker class that constructs chunking functions with call-level arguments passed.
Added more unit tests for new features as well as for multiple token counters and for ensuring there are no chunks comprised entirely of whitespace characters.

Changed

Began removing chunks comprised entirely of whitespace characters from the output of chunk().
Updated semchunk's description from 'A fast and lightweight Python library for splitting text into semantically meaningful chunks.' and 'A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.'.

Fixed

Fixed a typo in the docstring for the __call__() method of the Chunker class returned by chunkerify() where most of the documentation for the arguments were listed under the section for the method's returns.

Removed

Removed undocumented, private chunk() method from the Chunker class returned by chunkerify().
Removed undocumented, private _reattach_whitespace_splitters argument of chunk() that was introduced to experiment with potentially adding support for overlap ratios.

Assets 2

19 Dec 05:00

umarbutler

v2.2.2

Fixed

Ensured hatch does not include irrelevant files in the distribution.

Assets 2

17 Dec 04:47

umarbutler

v2.2.1

Changed

Started benchmarking semantic-text-splitter in parallel to ensure a fair comparison, courtesy of @benbrandt (#17).

Assets 2

12 Jul 11:29

umarbutler

v2.2.0

Changed

Switched from having chunkerify() output a function to having it return an instance of the new Chunker() class which should not alter functionality in any way but will allow for the preservation of type hints, fixing #7.

Assets 2

20 Jun 02:50

umarbutler

v2.1.0

Fixed

Ceased memoizing chunk() (but not token counters) due to the fact that cached outputs of memoized functions are shallow rather than deep copies of original outputs, meaning that if one were to chunk a text and then chunk that same text again and then modify one of the chunks outputted by the first call, the chunks outputted by the second call would also be modified. This behaviour is not expected and therefore undesirable. The memoization of token counters is not impacted as they output immutable objects, namely, integers.

Assets 2

19 Jun 06:08

umarbutler

v2.0.0

Added

Added support for multiprocessing through the processes argument passable to chunkers constructed by chunkerify().

Removed

No longer guaranteed that semchunk is pure Python.

Assets 2