Does en_core_web_trf truncate documents to 512? #9280
-
It seems that the documentation does not specify if Is there such a truncation being performed? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Beta Was this translation helpful? Give feedback.
-
Or is it fine if each sentence is less than 512 length? |
Beta Was this translation helpful? Give feedback.
-
Although it says here, that Spacy splits documents longer than 512 into sentences first before feeding it to the Although my experiment here does not seem to have a problem:
|
Beta Was this translation helpful? Give feedback.
-
By default the Splitting long documents into sentences is something that happens just during training with the corpus reader, and the sentences come from the underlying corpus annotation, not from a component in the pipeline. Those settings aren't related to what happens when you run |
Beta Was this translation helpful? Give feedback.
By default the
transformer
component uses overlapping strided spans (see: https://spacy.io/api/transformer#span_getters) so you can train and predict on longer texts without issues on transformer models that have a fixed max length.Splitting long documents into sentences is something that happens just during training with the corpus reader, and the sentences come from the underlying corpus annotation, not from a component in the pipeline. Those settings aren't related to what happens when you run
en_core_web_trf
on a new text.