Skip to content
Discussion options

You must be logged in to vote

The is_sent_start values are ternary with True/False/None, where None means that you don't know whether this is a sentence boundary yet.

There are exceptions for the first token in a doc, which should always return True because False/None don't make sense as possible values.

A 1-token doc is considered to always have sentence boundaries. For a doc with 2+ tokens, at least one token other than the first token needs to have is_sent_start = True/False for doc.sents not to raise this error.

You can gradually add sentence boundaries with multiple pipeline components, but for doc.sents to recognize that you've finished setting all the sentence boundaries, you usually want the final sentence-set…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by rafelafrance
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / sentencizer Feature: Sentencizer (rule-based sentence segmenter)
2 participants