fix: preserve whitespace during split in _split_text_fallback by n3ural · Pull Request #431 · chonkie-inc/chonkie

n3ural · 2026-01-05T00:56:19Z

Summary

Fix index misalignment bug in _split_text_fallback that caused concatenated words in chunked output when using whitespace splitting mode.

Problem

When RecursiveLevel(whitespace=True) is used, the Python fallback implementation loses space characters during splitting, causing:

Word dropping: First word dropped with include_delim="prev", last word dropped with include_delim="next"
Index drift: sum(len(splits)) != len(original_text), causing chunk extraction to use wrong indices
Concatenated output: Final chunks contain text like "Thequickbrownfox" instead of "The quick brown fox"

Reproduction Example

from chonkie.chunker import SlumberChunker
from chonkie.types import RecursiveLevel

class MockGenie:
    def generate(self, prompt): return "mock"
    def generate_json(self, prompt, schema): return {"split_index": 1}

chunker = SlumberChunker(genie=MockGenie(), min_characters_per_chunk=1)
text = "The quick brown fox"

# Bug 1: include_delim="prev" drops first word
level = RecursiveLevel(whitespace=True, include_delim="prev")
splits = chunker._split_text(text, level)
print(splits)           # BUGGY: [' quick', ' brown', ' fox']  (first word dropped!)
print("".join(splits))  # BUGGY: ' quick brown fox'
# FIXED: ['The', ' quick', ' brown', ' fox']

# Bug 2: include_delim=None loses all spaces
level = RecursiveLevel(whitespace=True, include_delim=None)
splits = chunker._split_text(text, level)
print(splits)           # BUGGY: ['The', 'quick', 'brown', 'fox']
print("".join(splits))  # BUGGY: 'Thequickbrownfox'
# FIXED: ['The', ' ', 'quick', ' ', 'brown', ' ', 'fox']

Design Decision: Whitespace Handling Path

Decision: Whitespace mode uses Python fallback exclusively.

Context: The Cython regex splitting doesn't support the include_delim parameter correctly for whitespace patterns. The Python fallback properly handles all three modes:

Mode	Output for "hello world"
`"prev"`	`["hello", " world"]`
`"next"`	`["hello ", "world"]`
`None`	`["hello", " ", "world"]`

Consequences: If Cython whitespace support is needed in the future, the include_delim modes must be implemented before removing the bypass.

Changes

src/chonkie/chunker/slumber.py:
- Route whitespace mode through Python fallback (lines 107-110)
- Fix _split_text_fallback to preserve spaces and include all words (lines 145-154)
tests/chunkers/test_slumber_chunker.py:
- Added 10 whitespace-specific tests

Testing

All 39 tests pass (29 existing + 10 new whitespace tests).

New tests added:

test_split_text_whitespace_reconstruction - Verifies "".join(splits) == original_text for all modes
test_split_text_whitespace_index_alignment - Verifies sum(len(s)) == len(text)
test_split_text_whitespace_multiple_spaces - Handles consecutive spaces
test_split_text_whitespace_leading_trailing - Handles leading/trailing spaces
test_split_text_whitespace_empty_string - Edge case: empty input
test_split_text_whitespace_only_spaces - Edge case: only spaces
test_split_text_whitespace_single_word - Edge case: no spaces
test_whitespace_regression_no_concatenation - Regression test for concatenation bug
test_whitespace_mode_specific_structure - Verifies exact output per mode
test_whitespace_with_merging_enabled - Tests with default min_characters_per_chunk=24

gemini-code-assist · 2026-01-05T00:56:33Z

Summary of Changes

Hello @n3ural, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the text chunking mechanism where whitespace-based splitting led to incorrect output, including concatenated words and index mismatches. The core change involves routing all whitespace splitting operations through a refined Python fallback implementation, which now correctly preserves and handles spaces according to the specified include_delim mode. This ensures that the original text can always be accurately reconstructed from its generated chunks, significantly improving the reliability of the chunking process.

Highlights

Bug Fix: Whitespace Splitting: Resolved an index misalignment bug in _split_text_fallback that previously caused concatenated words, word dropping, and index drift when using whitespace splitting mode.
Whitespace Preservation: The fix ensures that whitespace characters are correctly preserved during text splitting, allowing for accurate reconstruction of the original text from its chunks.
Design Decision: Python Fallback: Whitespace splitting now exclusively uses the Python fallback implementation because the Cython extension does not correctly support the include_delim parameter for whitespace patterns.
Expanded Test Coverage: Added 10 new whitespace-specific tests to cover various scenarios, including reconstruction, index alignment, multiple spaces, leading/trailing spaces, empty strings, and regression for the concatenation bug.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the SlumberChunker to ensure correct handling of whitespace splitting, particularly with the include_delim parameter. The _split_text method now explicitly uses a Python fallback (_split_text_fallback) for all whitespace splitting operations, bypassing the Cython implementation which did not fully support include_delim modes. The _split_text_fallback method has been updated to accurately manage whitespace inclusion for 'prev', 'next', and None include_delim settings, with the None mode now using the re module to preserve spaces as distinct elements. A comprehensive set of new unit tests has been added to test_slumber_chunker.py to validate the robustness of whitespace splitting, covering reconstruction, index alignment, various space configurations (multiple, leading/trailing), edge cases (empty string, single word), and a regression test for concatenation issues.

n3ural added 2 commits January 5, 2026 01:33

fix whitespace preservation during split in _split_text_fallback

5ac1f34

add whitespace preservation tests

acaadfe

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

n3ural added 2 commits January 8, 2026 10:38

Merge branch 'main' into fix/whitespace-split-index-alignment

478452b

Merge branch 'main' into fix/whitespace-split-index-alignment

9819cf7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve whitespace during split in _split_text_fallback#431

fix: preserve whitespace during split in _split_text_fallback#431
n3ural wants to merge 4 commits intochonkie-inc:mainfrom
n3ural:fix/whitespace-split-index-alignment

n3ural commented Jan 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

n3ural commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Reproduction Example

Design Decision: Whitespace Handling Path

Changes

Testing

Uh oh!

gemini-code-assist bot commented Jan 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

n3ural commented Jan 5, 2026 •

edited

Loading