Releases: isaacus-dev/semchunk
Releases · isaacus-dev/semchunk
v3.0.4
Fixed
- Fixed bug where attempting to chunk only whitespace characters would raise
ValueError: not enough values to unpack (expected 2, got 0)(ScrapeGraphAI/Scrapegraph-ai#893).
v3.0.3
Fixed
- Fixed
isaacus/emubertmistakenly being set toisaacus-dev/emubertin the README and tests.
v3.0.2
v3.0.1
Fixed
- Fixed a bug where attempting to chunk an empty text would raise a
ValueError.
v3.0.0
Added
- Added an
offsetsargument tochunk()andChunker.__call__()that specifies whether to return the start and end offsets of each chunk (#9). The argument defaults toFalse. - Added an
overlapargument tochunk()andChunker.__call__()that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap (#1). The argument defaults toNone, in which case no overlapping occurs. - Added an undocumented, private
_make_chunk_function()method to theChunkerclass that constructs chunking functions with call-level arguments passed. - Added more unit tests for new features as well as for multiple token counters and for ensuring there are no chunks comprised entirely of whitespace characters.
Changed
- Began removing chunks comprised entirely of whitespace characters from the output of
chunk(). - Updated
semchunk's description from 'A fast and lightweight Python library for splitting text into semantically meaningful chunks.' and 'A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.'.
Fixed
- Fixed a typo in the docstring for the
__call__()method of theChunkerclass returned bychunkerify()where most of the documentation for the arguments were listed under the section for the method's returns.
Removed
- Removed undocumented, private
chunk()method from theChunkerclass returned bychunkerify(). - Removed undocumented, private
_reattach_whitespace_splittersargument ofchunk()that was introduced to experiment with potentially adding support for overlap ratios.
v2.2.2
Fixed
- Ensured
hatchdoes not include irrelevant files in the distribution.
v2.2.1
Changed
- Started benchmarking
semantic-text-splitterin parallel to ensure a fair comparison, courtesy of @benbrandt (#17).
v2.2.0
v2.1.0
Fixed
- Ceased memoizing
chunk()(but not token counters) due to the fact that cached outputs of memoized functions are shallow rather than deep copies of original outputs, meaning that if one were to chunk a text and then chunk that same text again and then modify one of the chunks outputted by the first call, the chunks outputted by the second call would also be modified. This behaviour is not expected and therefore undesirable. The memoization of token counters is not impacted as they output immutable objects, namely, integers.
v2.0.0
Added
- Added support for multiprocessing through the
processesargument passable to chunkers constructed bychunkerify().
Removed
- No longer guaranteed that
semchunkis pure Python.