Skip to content
Discussion options

You must be logged in to vote

By default the transformer component uses overlapping strided spans (see: https://spacy.io/api/transformer#span_getters) so you can train and predict on longer texts without issues on transformer models that have a fixed max length.

Splitting long documents into sentences is something that happens just during training with the corpus reader, and the sentences come from the underlying corpus annotation, not from a component in the pipeline. Those settings aren't related to what happens when you run en_core_web_trf on a new text.

Replies: 4 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by svlandeg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / transformer Feature: Transformer
2 participants
Converted from issue

This discussion was converted from issue #9273 on September 23, 2021 13:03.