Tokenizer exceptions for Sentencizer #7218

bittlingmayer · 2021-02-26T12:10:34Z

bittlingmayer
Feb 26, 2021

We just want to split paragraphs into sentences.

We are required to do it at scale and efficiently - for 100+ languages and with high-throughput and low memory usage - and Sentencizer is built for this:

A simple pipeline component to allow custom sentence boundary detection logic that doesn’t require the dependency parse. By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, rule-based strategy that doesn’t require a statistical model to be loaded.

The challenge is that each language has abbreviations that end with . so the sentence ends up split in the middle, which is catastrophic for any downstream processing.

spaCy's solution is to use a list of base exceptions as well as additional lists for English and the top dozen or so languages:

for orth in [
    "'d",
    "a.m.",
    "Adm.",
    "Bros.",
    "co.",
    "Co.",
    "Corp."
     ...
    "p.m.",
    "Ph.D.",
    "Prof.",
    "Rep.",
    "Rev.",
    "Sen.",
    "St.",
    "vs.",
    "v.s.",
]:
    _exc[orth] = [{ORTH: orth}]

Unfortunately, for all the other languages, adding such a list is apparently discouraged.

Needless to say, a language like Polish has its own abbreviations, so that means Sentencizer is not working normally for most of the languages it aims to support.

So for sentence segmentation to work normally for all languages, like it does in English and Spanish, we need a place to put these lists, so they can be loaded by Sentencizer and/or a rule that any 2 lowercase alphachars followed by a full stop is probably an abbrevation.

My instinct says that this should be the default for Sentencizer, because almost nobody who uses Sentencizer would expect or prefer the current behaviour over this, and it would be a pity if many clients had to create and maintain their own lists for this.

bittlingmayer · 2021-02-26T13:14:58Z

bittlingmayer
Feb 26, 2021
Author

Related: #4168

0 replies

adrianeboyd · 2021-02-26T14:33:01Z

adrianeboyd
Feb 26, 2021

The sentencizer is only one possible implementation of a rule-based sentence segmentation component. Be aware that punct_chars is a misnomer and it's really "punct tokens" underneath, so 1) you could split on any token like the if you wanted to, and 2) if you don't have good tokenization of punctuation symbols, it's not going to work (an issue for Chinese and Japanese other languages without whitespace separation if you're using with xx).

In your example from #7214, you could just add ... as a punct_char instead of having an additional component:

nlp.get_pipe("sentencizer").punct_chars.add("...")

This isn't 100% equivalent because the sentencizer looks for following punctuation like quotes before setting a boundary, but it's very similar. The punct_chars are saved with the sentencizer component in the saved model (nlp.to_disk()), so it's easy to preserve the customizations.

As mentioned in the issue #4168, instead of trying to make a more flexible rule-based sentence segmentation component, we wrote a very small statistical sentence segmenter component instead called senter.

You could try out the pretrained multi-language one with xx_sent_ud_sm. It should work relatively well for the languages mentioned in the description, with the exception of Chinese and Japanese. (It was a mistake to include them. They won't be included in the next release of that model.)

There is also a disabled senter component in all pretrained core pipelines, so you can try out a language-specific one with, e.g.:

nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"])
nlp.enable_pipe("senter")

It should be easy to train or update a senter model, since you just need Doc objects with sentence boundaries as training data.

For en_core_web_sm senter on the OntoNotes dev corpus:

TOK      99.93 
SENT P   89.80 
SENT R   78.28 
SENT F   83.64 
SPEED    301519

vs. en + sentencizer:

TOK      99.93  
SENT P   88.19  
SENT R   71.78  
SENT F   79.14  
SPEED    1135080

vs, en_core_web_sm parser:

TOK      99.93
SENT P   89.78
SENT R   87.53
SENT F   88.64
SPEED    22606

To clarify about adding exceptions: adding new exceptions to the default tokenizer settings in the core library is only discouraged for some languages given the training corpora used in the pretrained pipelines. The Polish training corpus is pretty unusual in splitting periods from abbreviations. It's impossible to have one default tokenizer configuration that's perfect for every task, so we've decided to have the library defaults follow the guidelines for the corpora used in the pretrained pipelines.

For languages without pretrained pipelines, we're a bit more dependent on user contributions since we don't have particular corpus guidelines to follow, so we're generally happy to accept contributions that look general-purpose (e.g., common abbreviations vs. biomedical jargon).

Since we know that the library tokenizer defaults are not going to be perfect for every task, we've made it very easy to modify the tokenizer settings. To add new exceptions:

nlp = spacy.blank("pl")
rules = dict(nlp.tokenizer.rules)
rules.update({"ds.": [{"ORTH": "ds."}]})
nlp.tokenizer.rules = rules
nlp.to_disk("/path/to/model")

reloaded_nlp = spacy.load("/path/to/model")

You can update the rules in-place with nlp.tokenizer.rules["ds."] = [{"ORTH": "ds."}], but you can run into problems with the tokenizer cache if you've already run the pipeline on some texts, so I recommend assigning the complete rules dict to nlp.tokenizer.rules instead, which will always clear the cache.

The tokenizer exceptions are saved as part of the tokenizer settings in the model directory, so the reloaded model has the saved settings, same as for the sentencizer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tokenizer exceptions for Sentencizer #7218

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Tokenizer exceptions for Sentencizer #7218

Uh oh!

Uh oh!

bittlingmayer Feb 26, 2021

Replies: 2 comments

Uh oh!

bittlingmayer Feb 26, 2021 Author

Uh oh!

adrianeboyd Feb 26, 2021

bittlingmayer
Feb 26, 2021

bittlingmayer
Feb 26, 2021
Author

adrianeboyd
Feb 26, 2021