Sentence boundary detection in v3 #7624

BramVanroy · 2021-03-31T14:23:32Z

BramVanroy
Mar 31, 2021

IIRC, v2 builds sentences based on the dependency structure. Alternatively you could specify a rule-based sentencizer. Looking at the built-in pipeline components of v3, it seems that things have changed. Now a trainable senter is available in addition to the sentencizer. However, it is not clear which component now takes care of sentence segmentation, and whether the dependency parser still plays a role in this. (The base class of this new SentenceRecognizer is Tagger, which is the POS-tagger and not the dependency parser - which makes things more complicated.)

Ultimately my goal is to control whether to enable/disable sentence segmentation correctly. What we used to do in v2 was the following (adapted to v3 by registering the component):

import spacy
from spacy import Language
from spacy.tokens import Doc


@Language.component("prevent_sbd")
def spacy_prevent_sbd(doc: Doc):
    """Disables spaCy's sentence boundary detection."""
    for token in doc:
        token.is_sent_start = False
    return doc


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("prevent_sbd", name="prevent-sbd", before="parser")
print(nlp.pipe_names)

doc = nlp("I like cookies. Do you like them?")
for sent in doc.sents:
    print(sent)

and this still works for en_core_web_sm. Printing the components also does not include the senter or sentencizer. So what I think is happening is that it depends on which components were trained and included in the model, and that because a senter was not trained, and because no rule-based sentencizer was included, the fall-back goes to sentence segmentation on the dependency parse. Is that correct?

In addition, I would assume that spacy_prevent_sbd only works in case of the dependency-parse approach but not if a senter component is available. So to be safe, I'd probably want to disable any senter and sentencizer to ensure that no sentence segmentation is done by those components if they are present in the model.

EDIT: I was doing some tests, and it seems that including tok2vec is very important in the pipeline for sentence segmentation. In the following example, when tok2vec is excluded, the sentence is split up into I like cookies and !, which is strange - even for rule-based systems. An explanation for this is welcome too.

import spacy
# in/exclude tok2vec to see the difference
nlp = spacy.load("en_core_web_sm", exclude=["tok2vec"])
doc = nlp("I like cookies!")
sents = list(doc.sents)
len(sents)
# 1 with the tok2vec component, 2 without

Answered by adrianeboyd

Mar 31, 2021

We're still working on some additional pretty diagrams, but here's a new section in the docs about the structure of the pretrained pipelines:

https://spacy.io/models#design

Excluding tok2vec breaks the parser, which is the only thing doing sentence segmentation by default in en_core_web_sm, and when it gets random input it often predicts a lot of roots. The senter model is included but is disabled by default and does not depend on tok2vec.

sentencizer, senter, and parser should only modify previously unset sentence boundaries (where token.is_sent_start == None).

What the underlying model is for senter doesn't really matter as long as the annotation it's setting in the end is Token.is_sent…

View full answer

adrianeboyd · 2021-03-31T18:20:22Z

adrianeboyd
Mar 31, 2021

We're still working on some additional pretty diagrams, but here's a new section in the docs about the structure of the pretrained pipelines:

https://spacy.io/models#design

Excluding tok2vec breaks the parser, which is the only thing doing sentence segmentation by default in en_core_web_sm, and when it gets random input it often predicts a lot of roots. The senter model is included but is disabled by default and does not depend on tok2vec.

sentencizer, senter, and parser should only modify previously unset sentence boundaries (where token.is_sent_start == None).

What the underlying model is for senter doesn't really matter as long as the annotation it's setting in the end is Token.is_sent_start. It could have been a different kind of model, but in this case it's a teeny tagger with two hard-coded tags for sent-start and not-sent-start and that's all it predicts and sets.

2 replies

BramVanroy Mar 31, 2021
Author

Okay, that was not clear to me indeed, and it is very important to know. For instance, on the tok2vec page, no information is given about senter or sentence boundaries, but as illustrated it does play a crucial role in that process. I suggested this before in a different issue, but perhaps the individual component pages can give more information about their impact down the line and what happens when they are excluded.

I imagine that is very hard for you to document everything clearly - especially now with even more customizability in v3. Some of these things are very important and can lead to unexpected results, so documentation is incredibly important. So I am very glad that we have this discussion board now, and quick answers from you!

adrianeboyd Apr 1, 2021

Very flexible/configurable things are indeed hard to document.

The problem is that tok2vec itself is just a component that saves vectors in doc.tensor and it's the whole pipeline configuration that determines what happens with this information. The parser could listen to a tok2vec component or it could have its own internal tok2vec model that's not a separate component. There's nothing about tok2vec itself that's related to parsing or sentence boundaries. You have to look at the config to understand how a pipeline is set up.

There's some static analysis available from nlp.analyze_pipes(), but even this is hard for a lot of components like attribute ruler and lemmatizer. The attribute ruler can potentially rely on or set any token attribute and the lemmatizer sets lemmas, but the annotation that it requires varies a lot.

Also note that the config you get from spacy init config is not the same as what's in the pretrained pipelines, so the link above is really just relevant for the provided pretrained pipelines, not v3 pipelines in general.

When we were thinking about how to design the pretty diagrams, we were thinking that it would be nice to generate them, at least partially, from the config files. It could be kind of like this, but for configs: https://github.com/pmbaumgartner/spacy-project-viz . The dependencies in the configs are not as explicit as in the project files, though. You can have upstream = * for the listener and you have to inspect the attribute ruler patterns to know what it depends on and sets, etc. The lemmatizer is still kind of impossible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Sentence boundary detection in v3 #7624

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Sentence boundary detection in v3 #7624

Uh oh!

Uh oh!

BramVanroy Mar 31, 2021

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Mar 31, 2021

Uh oh!

Uh oh!

BramVanroy Mar 31, 2021 Author

Uh oh!

adrianeboyd Apr 1, 2021

BramVanroy
Mar 31, 2021

Replies: 1 comment 2 replies

adrianeboyd
Mar 31, 2021

BramVanroy Mar 31, 2021
Author