Sentence boundary detection in v3 #7624
-
IIRC, v2 builds sentences based on the dependency structure. Alternatively you could specify a rule-based Ultimately my goal is to control whether to enable/disable sentence segmentation correctly. What we used to do in v2 was the following (adapted to v3 by registering the component): import spacy
from spacy import Language
from spacy.tokens import Doc
@Language.component("prevent_sbd")
def spacy_prevent_sbd(doc: Doc):
"""Disables spaCy's sentence boundary detection."""
for token in doc:
token.is_sent_start = False
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("prevent_sbd", name="prevent-sbd", before="parser")
print(nlp.pipe_names)
doc = nlp("I like cookies. Do you like them?")
for sent in doc.sents:
print(sent) and this still works for In addition, I would assume that EDIT: I was doing some tests, and it seems that including import spacy
# in/exclude tok2vec to see the difference
nlp = spacy.load("en_core_web_sm", exclude=["tok2vec"])
doc = nlp("I like cookies!")
sents = list(doc.sents)
len(sents)
# 1 with the tok2vec component, 2 without |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
We're still working on some additional pretty diagrams, but here's a new section in the docs about the structure of the pretrained pipelines: https://spacy.io/models#design Excluding
What the underlying model is for |
Beta Was this translation helpful? Give feedback.
We're still working on some additional pretty diagrams, but here's a new section in the docs about the structure of the pretrained pipelines:
https://spacy.io/models#design
Excluding
tok2vec
breaks the parser, which is the only thing doing sentence segmentation by default inen_core_web_sm
, and when it gets random input it often predicts a lot of roots. Thesenter
model is included but is disabled by default and does not depend ontok2vec
.sentencizer
,senter
, andparser
should only modify previously unset sentence boundaries (wheretoken.is_sent_start == None
).What the underlying model is for
senter
doesn't really matter as long as the annotation it's setting in the end isToken.is_sent…