Custom sentence function not working? #9974

djmechanic · 2022-01-03T20:33:07Z

djmechanic
Jan 3, 2022

[Edited to add info regarding tokenizer special case]

On Spacy 3.1.2 I've written a custom sentence function based on the example given at https://spacy.io/usage/linguistic-features#sbd-custom. Because I'm starting a sentence on special case tokens YOLO. or YODO. the next token after the period (at i + 2) should obviously NOT start a new sentence so I set it to FALSE according to the docs.

# Special case tokens include the period
nlp.tokenizer.add_special_case("YOLO.", [{spacy.attrs.ORTH: "YOLO."}])
nlp.tokenizer.add_special_case("YODO.", [{spacy.attrs.ORTH: "YODO."}])

Next my custom sentence function, to start sentences on the special case tokens.

for token in doc[:-1]:
    # If this token is a new sentence, next token is not
    if token.text in ["YOLO.", "YODO."]:
        doc[token.i + 1].is_sent_start = True
        doc[token.i + 2].is_sent_start = False
        continue

However when I run my pipeline this is exactly what's not happening. The YOLO/YODO are being picked up correctly but not the next token. See output below (not my real data of course):

# Try to markup as code
"sents": [
	"YOLO.",
	"BLA BLA",  # <-- WRONG! Naughty pipeline!
	"YODO.",
	"HUBBA BUBBA"  # <-- WRONG! Naughty pipeline!
],

Any idea why is the pipeline not respecting my setting FALSE on the next token?

Answered by djmechanic

Jan 4, 2022

Solved: the problem was indexing.

In the example given in the docs, the token following the ellipsis should start the new sentence, whereas in my example the token itself (YOLO. or YODO.) should start the sentence, hence

    doc[token.i].is_sent_start = True
    doc[token.i + 1].is_sent_start = False

Now it works.

View full answer

djmechanic · 2022-01-04T04:03:27Z

djmechanic
Jan 4, 2022
Author

Solved: the problem was indexing.

In the example given in the docs, the token following the ellipsis should start the new sentence, whereas in my example the token itself (YOLO. or YODO.) should start the sentence, hence

    doc[token.i].is_sent_start = True
    doc[token.i + 1].is_sent_start = False

Now it works.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Custom sentence function not working? #9974

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Custom sentence function not working? #9974

Uh oh!

Uh oh!

djmechanic Jan 3, 2022

Replies: 1 comment

Uh oh!

djmechanic Jan 4, 2022 Author

djmechanic
Jan 3, 2022

djmechanic
Jan 4, 2022
Author