fix: preserve whitespace during split in _split_text_fallback#431
fix: preserve whitespace during split in _split_text_fallback#431n3ural wants to merge 4 commits intochonkie-inc:mainfrom
Conversation
Summary of ChangesHello @n3ural, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug in the text chunking mechanism where whitespace-based splitting led to incorrect output, including concatenated words and index mismatches. The core change involves routing all whitespace splitting operations through a refined Python fallback implementation, which now correctly preserves and handles spaces according to the specified Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the SlumberChunker to ensure correct handling of whitespace splitting, particularly with the include_delim parameter. The _split_text method now explicitly uses a Python fallback (_split_text_fallback) for all whitespace splitting operations, bypassing the Cython implementation which did not fully support include_delim modes. The _split_text_fallback method has been updated to accurately manage whitespace inclusion for 'prev', 'next', and None include_delim settings, with the None mode now using the re module to preserve spaces as distinct elements. A comprehensive set of new unit tests has been added to test_slumber_chunker.py to validate the robustness of whitespace splitting, covering reconstruction, index alignment, various space configurations (multiple, leading/trailing), edge cases (empty string, single word), and a regression test for concatenation issues.
Summary
Fix index misalignment bug in
_split_text_fallbackthat caused concatenated words in chunked output when using whitespace splitting mode.Problem
When
RecursiveLevel(whitespace=True)is used, the Python fallback implementation loses space characters during splitting, causing:include_delim="prev", last word dropped withinclude_delim="next"sum(len(splits)) != len(original_text), causing chunk extraction to use wrong indices"Thequickbrownfox"instead of"The quick brown fox"Reproduction Example
Design Decision: Whitespace Handling Path
Decision: Whitespace mode uses Python fallback exclusively.
Context: The Cython regex splitting doesn't support the
include_delimparameter correctly for whitespace patterns. The Python fallback properly handles all three modes:"prev"["hello", " world"]"next"["hello ", "world"]None["hello", " ", "world"]Consequences: If Cython whitespace support is needed in the future, the
include_delimmodes must be implemented before removing the bypass.Changes
src/chonkie/chunker/slumber.py:_split_text_fallbackto preserve spaces and include all words (lines 145-154)tests/chunkers/test_slumber_chunker.py:Testing
All 39 tests pass (29 existing + 10 new whitespace tests).
New tests added:
test_split_text_whitespace_reconstruction- Verifies"".join(splits) == original_textfor all modestest_split_text_whitespace_index_alignment- Verifiessum(len(s)) == len(text)test_split_text_whitespace_multiple_spaces- Handles consecutive spacestest_split_text_whitespace_leading_trailing- Handles leading/trailing spacestest_split_text_whitespace_empty_string- Edge case: empty inputtest_split_text_whitespace_only_spaces- Edge case: only spacestest_split_text_whitespace_single_word- Edge case: no spacestest_whitespace_regression_no_concatenation- Regression test for concatenation bugtest_whitespace_mode_specific_structure- Verifies exact output per modetest_whitespace_with_merging_enabled- Tests with defaultmin_characters_per_chunk=24