Fix(sentencizer): Correct sentence segmentation for quoted text (issue #13883) #13897
+39
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR resolves issue #13883, which caused incorrect sentence segmentation for text containing quoted dialogue, particularly with guillemets (
« »).The Bug
The
Sentencizerdid not track quote nesting depth. As a result, it would prematurely split sentences whenever it encountered sentence-ending punctuation (!,?,.) inside a quote, fragmenting dialogue into multiple incorrect sentences.How to Reproduce the Bug
The following code demonstrates the incorrect behavior on an unpatched version of spaCy:
Incorrect Output (Before Fix):
The Fix
The fix makes the
Sentencizercontext-aware of quotes by updating the prediction logic inspacy/pipeline/sentencizer.pyx:quote_depthcounter now tracks when the tokenizer is inside a quoted section, handling various quote types (symmetric and asymmetric) and nested quotes.pending_split_after_quoteflag defers sentence splits when punctuation is found inside a quote, applying the split only after the quote is closed.With this fix, the reproduction code now produces the Correct Output:
Testing and Validation
The modified component compiles successfully and has been validated against a comprehensive test suite (139+ test cases) simulating the sentencizer's logic. Testing confirmed that the fix resolves the original bug and introduces no regressions. The validation covered:
Types of change
A bug fix.
Checklist