Skip to content

Fix boundary_symbol detection when followed by whitespace#40

Merged
santhoshtr merged 1 commit intomasterfrom
fix/boundary-symbol-detection
Jan 19, 2026
Merged

Fix boundary_symbol detection when followed by whitespace#40
santhoshtr merged 1 commit intomasterfrom
fix/boundary-symbol-detection

Conversation

@santhoshtr
Copy link
Member

Summary

Fixes issue #35 where sentence boundary symbols are not detected when followed by trailing whitespace.

Problem

When sentences ended with a boundary symbol (e.g., .) followed by whitespace, the boundary_symbol field in the returned SentenceBoundary was incorrectly set to None. This affected the WASM bindings used in browsers where users couldn't accurately determine which character ended each sentence.

Example of the Bug

Text: "Hello world. This is a test.Another test. And another test."
Expected: All 4 sentences should have boundary_symbol = "."
Actual: Only 2 sentences had the symbol, the other 2 were None

Root Cause

The boundary symbol detection code was examining the last character of the sentence slice paragraph[..end]. When the slice included trailing whitespace (the space after the period), the last character was a space, not the punctuation mark.

Solution

The fix trims trailing whitespace before checking for the boundary symbol, ensuring that punctuation marks (., !, ?, , etc.) are correctly identified regardless of trailing whitespace.

Changes

  • src/languages/language.rs: Modified the boundary symbol detection logic to trim trailing whitespace before extracting the terminator character
  • src/lib.rs: Added 5 comprehensive test cases covering:

Testing

  • ✅ All 5 new tests pass
  • ✅ All 52+ existing tests pass with no regressions
  • ✅ Multi-byte character handling verified (CJK, emoji)
  • ✅ Code quality checks pass (clippy, fmt)

Impact

This is a bug fix with no breaking changes. Users of the library will now correctly receive boundary symbol information for all sentences, improving the accuracy of the metadata returned by get_sentence_boundaries().

Fixes an issue where sentence boundary symbols were not detected when
followed by trailing whitespace. The problem occurred because the code
was looking at the last character of the sentence slice, which could be
a space rather than the terminating punctuation.

The fix trims trailing whitespace before checking for the boundary symbol,
ensuring that punctuation marks like '.', '!', '?' are correctly identified
regardless of whether they're followed by spaces or tabs.

Changes:
- Modified src/languages/language.rs to trim trailing whitespace before
  detecting boundary symbols
- Added 5 new comprehensive test cases to verify the fix:
  - test_boundary_symbol_detection_with_trailing_space (main bug reproduction)
  - test_boundary_symbol_with_multiple_trailing_spaces
  - test_boundary_symbol_with_mixed_terminators
  - test_boundary_symbol_with_cjk_terminator
  - test_boundary_symbol_with_tabs_and_spaces

All existing tests pass with no regressions.
@santhoshtr
Copy link
Member Author

PR Prepared using opencode+claude-sonnet-4.5
Acknowledging the fix proposal by @fosple. Thanks.

@santhoshtr santhoshtr closed this Jan 19, 2026
@santhoshtr santhoshtr reopened this Jan 19, 2026
@santhoshtr santhoshtr merged commit 74b7839 into master Jan 19, 2026
39 checks passed
@santhoshtr santhoshtr deleted the fix/boundary-symbol-detection branch January 19, 2026 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant