Skip to content
Discussion options

You must be logged in to vote

If you already have working logic that includes rules, instead of using the Sentencizer you can create a small custom component that assigns is_sent_start to all tokens in a Doc. The sentencizer is only for very simple punctuation based tokenization.

We don't want to split on Abbreviations (e.g. Govt. Inc. etc.) or bullet points [ a. b). etc. ]

Note that if you're concerned about cases like that, you usually want a statistical model to handle ambiguous cases like "He works for Stuff Inc. I don't.", where an abbreviation is also an end of a sentence.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / sentencizer Feature: Sentencizer (rule-based sentence segmenter)
2 participants