Tokenizer exceptions for Sentencizer #7218
Replies: 2 comments
-
Related: #4168 |
Beta Was this translation helpful? Give feedback.
-
The In your example from #7214, you could just add nlp.get_pipe("sentencizer").punct_chars.add("...") This isn't 100% equivalent because the sentencizer looks for following punctuation like quotes before setting a boundary, but it's very similar. The As mentioned in the issue #4168, instead of trying to make a more flexible rule-based sentence segmentation component, we wrote a very small statistical sentence segmenter component instead called You could try out the pretrained multi-language one with There is also a disabled nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"])
nlp.enable_pipe("senter") It should be easy to train or update a For
vs.
vs,
To clarify about adding exceptions: adding new exceptions to the default tokenizer settings in the core library is only discouraged for some languages given the training corpora used in the pretrained pipelines. The Polish training corpus is pretty unusual in splitting periods from abbreviations. It's impossible to have one default tokenizer configuration that's perfect for every task, so we've decided to have the library defaults follow the guidelines for the corpora used in the pretrained pipelines. For languages without pretrained pipelines, we're a bit more dependent on user contributions since we don't have particular corpus guidelines to follow, so we're generally happy to accept contributions that look general-purpose (e.g., common abbreviations vs. biomedical jargon). Since we know that the library tokenizer defaults are not going to be perfect for every task, we've made it very easy to modify the tokenizer settings. To add new exceptions: nlp = spacy.blank("pl")
rules = dict(nlp.tokenizer.rules)
rules.update({"ds.": [{"ORTH": "ds."}]})
nlp.tokenizer.rules = rules
nlp.to_disk("/path/to/model")
reloaded_nlp = spacy.load("/path/to/model") You can update the rules in-place with The tokenizer exceptions are saved as part of the tokenizer settings in the model directory, so the reloaded model has the saved settings, same as for the sentencizer. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We just want to split paragraphs into sentences.
We are required to do it at scale and efficiently - for 100+ languages and with high-throughput and low memory usage - and Sentencizer is built for this:
The challenge is that each language has abbreviations that end with
.
so the sentence ends up split in the middle, which is catastrophic for any downstream processing.spaCy's solution is to use a list of base exceptions as well as additional lists for English and the top dozen or so languages:
Unfortunately, for all the other languages, adding such a list is apparently discouraged.
Needless to say, a language like Polish has its own abbreviations, so that means Sentencizer is not working normally for most of the languages it aims to support.
So for sentence segmentation to work normally for all languages, like it does in English and Spanish, we need a place to put these lists, so they can be loaded by Sentencizer and/or a rule that any 2 lowercase alphachars followed by a full stop is probably an abbrevation.
My instinct says that this should be the default for Sentencizer, because almost nobody who uses Sentencizer would expect or prefer the current behaviour over this, and it would be a pity if many clients had to create and maintain their own lists for this.
Beta Was this translation helpful? Give feedback.
All reactions