Fix boundary_symbol detection when followed by whitespace#40
Merged
santhoshtr merged 1 commit intomasterfrom Jan 19, 2026
Merged
Fix boundary_symbol detection when followed by whitespace#40santhoshtr merged 1 commit intomasterfrom
santhoshtr merged 1 commit intomasterfrom
Conversation
Fixes an issue where sentence boundary symbols were not detected when followed by trailing whitespace. The problem occurred because the code was looking at the last character of the sentence slice, which could be a space rather than the terminating punctuation. The fix trims trailing whitespace before checking for the boundary symbol, ensuring that punctuation marks like '.', '!', '?' are correctly identified regardless of whether they're followed by spaces or tabs. Changes: - Modified src/languages/language.rs to trim trailing whitespace before detecting boundary symbols - Added 5 new comprehensive test cases to verify the fix: - test_boundary_symbol_detection_with_trailing_space (main bug reproduction) - test_boundary_symbol_with_multiple_trailing_spaces - test_boundary_symbol_with_mixed_terminators - test_boundary_symbol_with_cjk_terminator - test_boundary_symbol_with_tabs_and_spaces All existing tests pass with no regressions.
Member
Author
|
PR Prepared using opencode+claude-sonnet-4.5 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes issue #35 where sentence boundary symbols are not detected when followed by trailing whitespace.
Problem
When sentences ended with a boundary symbol (e.g.,
.) followed by whitespace, theboundary_symbolfield in the returnedSentenceBoundarywas incorrectly set toNone. This affected the WASM bindings used in browsers where users couldn't accurately determine which character ended each sentence.Example of the Bug
Root Cause
The boundary symbol detection code was examining the last character of the sentence slice
paragraph[..end]. When the slice included trailing whitespace (the space after the period), the last character was a space, not the punctuation mark.Solution
The fix trims trailing whitespace before checking for the boundary symbol, ensuring that punctuation marks (
.,!,?,。, etc.) are correctly identified regardless of trailing whitespace.Changes
Testing
Impact
This is a bug fix with no breaking changes. Users of the library will now correctly receive boundary symbol information for all sentences, improving the accuracy of the metadata returned by
get_sentence_boundaries().