Extending the Sentencizer | Custom rules #9913

kapilok2014 · 2021-12-20T15:08:54Z

kapilok2014
Dec 20, 2021

Hi SpaCy Experts,

We have tested and compared the default sentencizer (parser), senter and SentenceRecognizer.
We don't want to split on Abbreviations (e.g. Govt. Inc. etc.) or bullet points [ a. b). etc. ]
Our splitting logic is mostly rule-based (abbrev. lookup OR regex) and we think the model-based senter will be too much effort to customize.

Please advise how we can extend the rule-based Sentencizer with the custom logic mentioned above.

Thanks
Kapil

Answered by polm

Dec 21, 2021

If you already have working logic that includes rules, instead of using the Sentencizer you can create a small custom component that assigns is_sent_start to all tokens in a Doc. The sentencizer is only for very simple punctuation based tokenization.

We don't want to split on Abbreviations (e.g. Govt. Inc. etc.) or bullet points [ a. b). etc. ]

Note that if you're concerned about cases like that, you usually want a statistical model to handle ambiguous cases like "He works for Stuff Inc. I don't.", where an abbreviation is also an end of a sentence.

View full answer

polm · 2021-12-21T07:34:52Z

polm
Dec 21, 2021

If you already have working logic that includes rules, instead of using the Sentencizer you can create a small custom component that assigns is_sent_start to all tokens in a Doc. The sentencizer is only for very simple punctuation based tokenization.

We don't want to split on Abbreviations (e.g. Govt. Inc. etc.) or bullet points [ a. b). etc. ]

Note that if you're concerned about cases like that, you usually want a statistical model to handle ambiguous cases like "He works for Stuff Inc. I don't.", where an abbreviation is also an end of a sentence.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Extending the Sentencizer | Custom rules #9913

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Extending the Sentencizer | Custom rules #9913

Uh oh!

kapilok2014 Dec 20, 2021

Replies: 1 comment

Uh oh!

polm Dec 21, 2021

kapilok2014
Dec 20, 2021

polm
Dec 21, 2021