Setting only the 0 index in a sentencizer #12446
-
On the off chance you aren't aware of this. I am using a custom sentencizer that itself uses multiple spacy sentencizers at a time to see if I can improve sentence splitting for technical notations, bibliographies, etc. by only breaking when they both agree. Setting only the
When the document was a single sentence my splitter only added a single flag at You can easily get around this by first setting Code may be clearer to see what I am doing. @Language.factory(SENTENCE)
class Sentence:
def __init__(self, nlp: Language, name: str, base_model: str = "en_core_web_sm"):
self.nlp = nlp
self.name = name
self.nlp_d = spacy.load(base_model, exclude=["ner"])
self.nlp_s = spacy.load(base_model, exclude=["parser", "ner"])
self.nlp_s.enable_pipe("senter")
def __call__(self, doc: Doc) -> Doc:
# Workaround needed for spacy because setting 0 by itself does not work
doc[-1].is_sent_start = False
doc_d = self.nlp_d(doc.text)
doc_s = self.nlp_s(doc.text)
starts_d = {s.start for s in doc_d.sents}
starts_s = {s.start for s in doc_s.sents}
agree = starts_d & starts_s
for i in agree:
doc[i].is_sent_start = True
return doc |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
The There are exceptions for the first token in a doc, which should always return A 1-token doc is considered to always have sentence boundaries. For a doc with 2+ tokens, at least one token other than the first token needs to have You can gradually add sentence boundaries with multiple pipeline components, but for for i in range(len(doc)):
if i in agree:
doc[i].is_sent_start = True
elif doc[i].is_sent_start is None:
doc[i].is_sent_start = False |
Beta Was this translation helpful? Give feedback.
-
Got it. Thank you. Using: for i in range(len(doc)):
doc[i].is_sent_start = i in agree |
Beta Was this translation helpful? Give feedback.
The
is_sent_start
values are ternary withTrue
/False
/None
, whereNone
means that you don't know whether this is a sentence boundary yet.There are exceptions for the first token in a doc, which should always return
True
becauseFalse
/None
don't make sense as possible values.A 1-token doc is considered to always have sentence boundaries. For a doc with 2+ tokens, at least one token other than the first token needs to have
is_sent_start = True/False
fordoc.sents
not to raise this error.You can gradually add sentence boundaries with multiple pipeline components, but for
doc.sents
to recognize that you've finished setting all the sentence boundaries, you usually want the final sentence-set…