Skip to content

Commit e24fbd4

Browse files
committed
feat(algorithm): improved quality of chunks, particularly with low chunk sizes or few newlines (#17)
1 parent 09a29ea commit e24fbd4

File tree

4 files changed

+18
-3
lines changed

4 files changed

+18
-3
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [3.2.0] - 2025-03-20
5+
### Changed
6+
- Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the `semchunk` algorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general ([#17](https://github.com/isaacus-dev/semchunk/issues/17)).
7+
48
## [3.1.3] - 2025-03-10
59
### Changed
610
- Added mention of Isaacus to the README.
@@ -141,6 +145,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
141145
### Added
142146
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
143147

148+
[3.2.0]: https://github.com/isaacus-dev/semchunk/compare/v3.1.3...v3.2.0
144149
[3.1.3]: https://github.com/isaacus-dev/semchunk/compare/v3.1.2...v3.1.3
145150
[3.1.2]: https://github.com/isaacus-dev/semchunk/compare/v3.1.1...v3.1.2
146151
[3.1.1]: https://github.com/isaacus-dev/semchunk/compare/v3.1.0...v3.1.1

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ This function returns a list of chunks up to `chunk_size`-tokens-long, with any
141141
To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
142142
1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
143143
1. The largest sequence of tabs;
144-
1. The largest sequence of whitespace characters (as defined by regex's `\s` character class);
144+
1. The largest sequence of whitespace characters (as defined by regex's `\s` character class) or, since version 3.2.0, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
145145
1. Sentence terminators (`.`, `?`, `!` and `*`);
146146
1. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `“`, `”`, `‘`, `’`, `'`, `"` and `` ` ``);
147147
1. Sentence interrupters (`:`, `—` and `…`);

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "semchunk"
7-
version = "3.1.3"
7+
version = "3.2.0"
88
authors = [
99
{name="Isaacus", email="[email protected]"},
1010
{name="Umar Butler", email="[email protected]"},

src/semchunk/semchunk.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@
5555
)
5656
"""A tuple of semantically meaningful non-whitespace splitters that may be used to chunk texts, ordered from most desirable to least desirable."""
5757

58+
_REGEX_ESCAPED_NON_WHITESPACE_SEMANTIC_SPLITTERS = tuple(re.escape(splitter) for splitter in _NON_WHITESPACE_SEMANTIC_SPLITTERS)
5859

5960
def _split_text(text: str) -> tuple[str, bool, list[str]]:
6061
"""Split text using the most semantically meaningful splitter possible."""
@@ -64,7 +65,7 @@ def _split_text(text: str) -> tuple[str, bool, list[str]]:
6465
# Try splitting at, in order of most desirable to least desirable:
6566
# - The largest sequence of newlines and/or carriage returns;
6667
# - The largest sequence of tabs;
67-
# - The largest sequence of whitespace characters; and
68+
# - The largest sequence of whitespace characters or, if the largest such sequence is only a single character and there exists a whitespace character preceded by a semantically meaningful non-whitespace splitter, then that whitespace character;
6869
# - A semantically meaningful non-whitespace splitter.
6970
if "\n" in text or "\r" in text:
7071
splitter = max(re.findall(r"[\r\n]+", text))
@@ -74,6 +75,15 @@ def _split_text(text: str) -> tuple[str, bool, list[str]]:
7475

7576
elif re.search(r"\s", text):
7677
splitter = max(re.findall(r"\s+", text))
78+
79+
# If the splitter is only a single character, see if we can target whitespace characters that are preceded by semantically meaningful non-whitespace splitters to avoid splitting in the middle of sentences.
80+
if len(splitter) == 1:
81+
for escaped_preceder in _REGEX_ESCAPED_NON_WHITESPACE_SEMANTIC_SPLITTERS:
82+
if (whitespace_preceded_by_preceder := re.search(rf'{escaped_preceder}(\s)', text)):
83+
splitter = whitespace_preceded_by_preceder.group(1)
84+
escaped_splitter = re.escape(splitter)
85+
86+
return splitter, splitter_is_whitespace, re.split(rf'(?<={escaped_preceder}){escaped_splitter}', text)
7787

7888
else:
7989
# Identify the most desirable semantically meaningful non-whitespace splitter present in the text.

0 commit comments

Comments
 (0)