Skip to content
Discussion options

You must be logged in to vote

We're still working on some additional pretty diagrams, but here's a new section in the docs about the structure of the pretrained pipelines:

https://spacy.io/models#design

Excluding tok2vec breaks the parser, which is the only thing doing sentence segmentation by default in en_core_web_sm, and when it gets random input it often predicts a lot of roots. The senter model is included but is disabled by default and does not depend on tok2vec.

sentencizer, senter, and parser should only modify previously unset sentence boundaries (where token.is_sent_start == None).

What the underlying model is for senter doesn't really matter as long as the annotation it's setting in the end is Token.is_sent…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@BramVanroy
Comment options

@adrianeboyd
Comment options

Answer selected by svlandeg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / sentencizer Feature: Sentencizer (rule-based sentence segmenter)
2 participants