Skip to content

fix: preserve whitespace during split in _split_text_fallback#431

Open
n3ural wants to merge 4 commits intochonkie-inc:mainfrom
n3ural:fix/whitespace-split-index-alignment
Open

fix: preserve whitespace during split in _split_text_fallback#431
n3ural wants to merge 4 commits intochonkie-inc:mainfrom
n3ural:fix/whitespace-split-index-alignment

Conversation

@n3ural
Copy link

@n3ural n3ural commented Jan 5, 2026

Summary

Fix index misalignment bug in _split_text_fallback that caused concatenated words in chunked output when using whitespace splitting mode.

Problem

When RecursiveLevel(whitespace=True) is used, the Python fallback implementation loses space characters during splitting, causing:

  1. Word dropping: First word dropped with include_delim="prev", last word dropped with include_delim="next"
  2. Index drift: sum(len(splits)) != len(original_text), causing chunk extraction to use wrong indices
  3. Concatenated output: Final chunks contain text like "Thequickbrownfox" instead of "The quick brown fox"

Reproduction Example

from chonkie.chunker import SlumberChunker
from chonkie.types import RecursiveLevel

class MockGenie:
    def generate(self, prompt): return "mock"
    def generate_json(self, prompt, schema): return {"split_index": 1}

chunker = SlumberChunker(genie=MockGenie(), min_characters_per_chunk=1)
text = "The quick brown fox"

# Bug 1: include_delim="prev" drops first word
level = RecursiveLevel(whitespace=True, include_delim="prev")
splits = chunker._split_text(text, level)
print(splits)           # BUGGY: [' quick', ' brown', ' fox']  (first word dropped!)
print("".join(splits))  # BUGGY: ' quick brown fox'
# FIXED: ['The', ' quick', ' brown', ' fox']

# Bug 2: include_delim=None loses all spaces
level = RecursiveLevel(whitespace=True, include_delim=None)
splits = chunker._split_text(text, level)
print(splits)           # BUGGY: ['The', 'quick', 'brown', 'fox']
print("".join(splits))  # BUGGY: 'Thequickbrownfox'
# FIXED: ['The', ' ', 'quick', ' ', 'brown', ' ', 'fox']

Design Decision: Whitespace Handling Path

Decision: Whitespace mode uses Python fallback exclusively.

Context: The Cython regex splitting doesn't support the include_delim parameter correctly for whitespace patterns. The Python fallback properly handles all three modes:

Mode Output for "hello world"
"prev" ["hello", " world"]
"next" ["hello ", "world"]
None ["hello", " ", "world"]

Consequences: If Cython whitespace support is needed in the future, the include_delim modes must be implemented before removing the bypass.

Changes

  • src/chonkie/chunker/slumber.py:

    • Route whitespace mode through Python fallback (lines 107-110)
    • Fix _split_text_fallback to preserve spaces and include all words (lines 145-154)
  • tests/chunkers/test_slumber_chunker.py:

    • Added 10 whitespace-specific tests

Testing

All 39 tests pass (29 existing + 10 new whitespace tests).

New tests added:

  • test_split_text_whitespace_reconstruction - Verifies "".join(splits) == original_text for all modes
  • test_split_text_whitespace_index_alignment - Verifies sum(len(s)) == len(text)
  • test_split_text_whitespace_multiple_spaces - Handles consecutive spaces
  • test_split_text_whitespace_leading_trailing - Handles leading/trailing spaces
  • test_split_text_whitespace_empty_string - Edge case: empty input
  • test_split_text_whitespace_only_spaces - Edge case: only spaces
  • test_split_text_whitespace_single_word - Edge case: no spaces
  • test_whitespace_regression_no_concatenation - Regression test for concatenation bug
  • test_whitespace_mode_specific_structure - Verifies exact output per mode
  • test_whitespace_with_merging_enabled - Tests with default min_characters_per_chunk=24

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @n3ural, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the text chunking mechanism where whitespace-based splitting led to incorrect output, including concatenated words and index mismatches. The core change involves routing all whitespace splitting operations through a refined Python fallback implementation, which now correctly preserves and handles spaces according to the specified include_delim mode. This ensures that the original text can always be accurately reconstructed from its generated chunks, significantly improving the reliability of the chunking process.

Highlights

  • Bug Fix: Whitespace Splitting: Resolved an index misalignment bug in _split_text_fallback that previously caused concatenated words, word dropping, and index drift when using whitespace splitting mode.
  • Whitespace Preservation: The fix ensures that whitespace characters are correctly preserved during text splitting, allowing for accurate reconstruction of the original text from its chunks.
  • Design Decision: Python Fallback: Whitespace splitting now exclusively uses the Python fallback implementation because the Cython extension does not correctly support the include_delim parameter for whitespace patterns.
  • Expanded Test Coverage: Added 10 new whitespace-specific tests to cover various scenarios, including reconstruction, index alignment, multiple spaces, leading/trailing spaces, empty strings, and regression for the concatenation bug.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the SlumberChunker to ensure correct handling of whitespace splitting, particularly with the include_delim parameter. The _split_text method now explicitly uses a Python fallback (_split_text_fallback) for all whitespace splitting operations, bypassing the Cython implementation which did not fully support include_delim modes. The _split_text_fallback method has been updated to accurately manage whitespace inclusion for 'prev', 'next', and None include_delim settings, with the None mode now using the re module to preserve spaces as distinct elements. A comprehensive set of new unit tests has been added to test_slumber_chunker.py to validate the robustness of whitespace splitting, covering reconstruction, index alignment, various space configurations (multiple, leading/trailing), edge cases (empty string, single word), and a regression test for concatenation issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant