Default sentence boundary detection during training in en_core_web_sm #8693

BramVanroy · 2021-07-12T08:18:44Z

BramVanroy
Jul 12, 2021

If I am correct, v3 came with some new ways to do sentence segmentation. Now there are a number of ways:

the DependencyParser jointly learns dependency parsing and sentence segmentation as described in the docstring
the Sentencizer is a rule-based, simple approach that does not require the dependency parse
the SentenceRecognizer is a trainable component for sentence segmentation
custom

My question is about the training config of en_core_web_sm. I can see the following:

[nlp]
lang = "en"
pipeline = ["tok2vec","tagger","parser","senter","attribute_ruler","lemmatizer","ner"]
disabled = ["senter"]

So the senter is included but disabled. A couple of questions about this.

Does it simply mean that it will be trained but when the model is loaded for inference, the senter is not included?
When multiple components are enabled that do SBD (e.g. parser, senter, and sentencizer), which one has the final say? Or will this throw an error because one component is trying to set an attribute that has already been set?
From the documentation it seems that the method relying on the dependency parse is most reliable when using general texts, but also slowest. Is that correct?

Thanks

Answered by adrianeboyd

Jul 12, 2021

Disabled components aren't trained if they're disabled in the config used for training. Training it separately and then disabling it is part of the internal collate script (and part of why we're not using spacy assemble directly, which would 90% of what we need). We want to ship senter with the pipeline but leave the default as the parser because the quality is higher.

I had a feeling I'd answered some of this before: #7624 (comment)

None of them clobber the annotations from previous components and the parser respects existing sentence boundaries.

The parser is by far the slowest, but here it probably makes sense to run evaluations with your own pipelines / data. The senter can be even fa…

View full answer

adrianeboyd · 2021-07-12T09:46:12Z

adrianeboyd
Jul 12, 2021

Disabled components aren't trained if they're disabled in the config used for training. Training it separately and then disabling it is part of the internal collate script (and part of why we're not using spacy assemble directly, which would 90% of what we need). We want to ship senter with the pipeline but leave the default as the parser because the quality is higher.

I had a feeling I'd answered some of this before: #7624 (comment)

None of them clobber the annotations from previous components and the parser respects existing sentence boundaries.

The parser is by far the slowest, but here it probably makes sense to run evaluations with your own pipelines / data. The senter can be even faster (at slightly reduced accuracy) if you further reduce the parameters in the model.

A related post: #7218 (comment)

3 replies

BramVanroy Jul 12, 2021
Author

You're absolutely right. Apologies for asking that second question again, which you answered earlier.

I guess I am staring too much at the configs for your pretrained models, which are not really the configs that were used for training, as you mention. So they cannot really serve as a starting point for my own models.

So if I understand you correctly, it might be worthwhile to simply train with the "senter" as part of the pipeline, and leave it up to the user whether or not they want to use it. But where can I specify then that the senter must be trained (in the config) but that it should not be part of the default pipeline that is loaded by a user later on?

adrianeboyd Jul 14, 2021

I don't think you can train and disable with one config. I haven't tested it to be sure, but I think you might be able source it as a disabled component with spacy assemble. If not, you'd have to use a custom script to configure the final pipeline (which is what we do).

It's a somewhat niche use case to want to ship disabled components, which is primarily because the pretrained pipelines are attempting to be general-purpose, plus we don't want to distribute alternate senter-only or senter-instead-of-parser pipelines for each language. If you had a more specific task in mind, you would just include it or not in your active pipeline.

BramVanroy Jul 14, 2021
Author

Right. From testing up to now with a couple of models I don't think I will include a senter. I aim to maximize dep/pos performance so my hope is that consequently the sentence segmentation from dependency parsing will also be quite good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Default sentence boundary detection during training in en_core_web_sm #8693

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Default sentence boundary detection during training in en_core_web_sm #8693

Uh oh!

BramVanroy Jul 12, 2021

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Jul 12, 2021

Uh oh!

BramVanroy Jul 12, 2021 Author

Uh oh!

adrianeboyd Jul 14, 2021

Uh oh!

BramVanroy Jul 14, 2021 Author

BramVanroy
Jul 12, 2021

Replies: 1 comment 3 replies

adrianeboyd
Jul 12, 2021

BramVanroy Jul 12, 2021
Author

BramVanroy Jul 14, 2021
Author