You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,10 @@
1
1
## Changelog 🔄
2
2
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
3
3
4
+
## [3.2.0] - 2025-03-20
5
+
### Changed
6
+
- Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the `semchunk` algorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general ([#17](https://github.com/isaacus-dev/semchunk/issues/17)).
7
+
4
8
## [3.1.3] - 2025-03-10
5
9
### Changed
6
10
- Added mention of Isaacus to the README.
@@ -141,6 +145,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
141
145
### Added
142
146
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -141,7 +141,7 @@ This function returns a list of chunks up to `chunk_size`-tokens-long, with any
141
141
To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
142
142
1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
143
143
1. The largest sequence of tabs;
144
-
1. The largest sequence of whitespace characters (as defined by regex's `\s` character class);
144
+
1. The largest sequence of whitespace characters (as defined by regex's `\s` character class) or, since version 3.2.0, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
# Try splitting at, in order of most desirable to least desirable:
65
66
# - The largest sequence of newlines and/or carriage returns;
66
67
# - The largest sequence of tabs;
67
-
# - The largest sequence of whitespace characters; and
68
+
# - The largest sequence of whitespace characters or, if the largest such sequence is only a single character and there exists a whitespace character preceded by a semantically meaningful non-whitespace splitter, then that whitespace character;
68
69
# - A semantically meaningful non-whitespace splitter.
# If the splitter is only a single character, see if we can target whitespace characters that are preceded by semantically meaningful non-whitespace splitters to avoid splitting in the middle of sentences.
0 commit comments