Setting only the 0 index in a sentencizer #12446

rafelafrance · 2023-03-19T19:17:42Z

rafelafrance
Mar 19, 2023

On the off chance you aren't aware of this.

I am using a custom sentencizer that itself uses multiple spacy sentencizers at a time to see if I can improve sentence splitting for technical notations, bibliographies, etc. by only breaking when they both agree.

Setting only the doc[0].is_sent_start = True and not setting any other doc[1:].is_sent_start to True or False results in an error when you use the sentencizer.

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline...

When the document was a single sentence my splitter only added a single flag at doc[0] and I get the above error.
It works fine if any other token is set to True.

You can easily get around this by first setting doc[-1].is_sent_start = False before setting anything else. This, surprisingly, this works on a 1-token document too. (See first code line in __call__.)

Code may be clearer to see what I am doing.

@Language.factory(SENTENCE)
class Sentence:
    def __init__(self, nlp: Language, name: str, base_model: str = "en_core_web_sm"):
        self.nlp = nlp
        self.name = name

        self.nlp_d = spacy.load(base_model, exclude=["ner"])

        self.nlp_s = spacy.load(base_model, exclude=["parser", "ner"])
        self.nlp_s.enable_pipe("senter")

    def __call__(self, doc: Doc) -> Doc:
        # Workaround needed for spacy because setting 0 by itself does not work
        doc[-1].is_sent_start = False

        doc_d = self.nlp_d(doc.text)
        doc_s = self.nlp_s(doc.text)

        starts_d = {s.start for s in doc_d.sents}
        starts_s = {s.start for s in doc_s.sents}

        agree = starts_d & starts_s

        for i in agree:
            doc[i].is_sent_start = True

        return doc

Answered by adrianeboyd

Mar 20, 2023

The is_sent_start values are ternary with True/False/None, where None means that you don't know whether this is a sentence boundary yet.

There are exceptions for the first token in a doc, which should always return True because False/None don't make sense as possible values.

A 1-token doc is considered to always have sentence boundaries. For a doc with 2+ tokens, at least one token other than the first token needs to have is_sent_start = True/False for doc.sents not to raise this error.

You can gradually add sentence boundaries with multiple pipeline components, but for doc.sents to recognize that you've finished setting all the sentence boundaries, you usually want the final sentence-set…

View full answer

adrianeboyd · 2023-03-20T08:02:53Z

adrianeboyd
Mar 20, 2023

The is_sent_start values are ternary with True/False/None, where None means that you don't know whether this is a sentence boundary yet.

There are exceptions for the first token in a doc, which should always return True because False/None don't make sense as possible values.

A 1-token doc is considered to always have sentence boundaries. For a doc with 2+ tokens, at least one token other than the first token needs to have is_sent_start = True/False for doc.sents not to raise this error.

You can gradually add sentence boundaries with multiple pipeline components, but for doc.sents to recognize that you've finished setting all the sentence boundaries, you usually want the final sentence-setting component to set both True/False for all tokens, so something like this:

        for i in range(len(doc)):
            if i in agree:
                doc[i].is_sent_start = True
            elif doc[i].is_sent_start is None:
                doc[i].is_sent_start = False

0 replies

rafelafrance · 2023-03-20T15:51:43Z

rafelafrance
Mar 20, 2023
Author

Got it. Thank you.

Using:

        for i in range(len(doc)):
            doc[i].is_sent_start = i in agree

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Setting only the 0 index in a sentencizer #12446

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Setting only the 0 index in a sentencizer #12446

Uh oh!

Uh oh!

rafelafrance Mar 19, 2023

Replies: 2 comments

Uh oh!

adrianeboyd Mar 20, 2023

Uh oh!

rafelafrance Mar 20, 2023 Author

rafelafrance
Mar 19, 2023

adrianeboyd
Mar 20, 2023

rafelafrance
Mar 20, 2023
Author