Skip to content
Discussion options

You must be logged in to vote

The short answer is that you can ignore this warning and you don't need to do anything to truncate or split your docs. The transformer component in spacy uses overlapping strided spans internally by default (see the settings in your config under [components.transformer.model.get_spans]) to be able to process longer texts.

The spacy tokenization (word tokenization with len(doc)) is not the same as the internal transformer tokenization (BPE, wordpiece, etc.). Usually the transformer tokenization has more tokens, but not necessarily. If 128 spacy tokens correspond to more than the transformer max_length tokens (which is unusual but can happen with things like long URLs), then spacy-transformers

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@markFriel
Comment options

@adrianeboyd
Comment options

@tomateit
Comment options

@adrianeboyd
Comment options

Answer selected by svlandeg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / transformer Feature: Transformer faq Frequently asked questions and solutions.
3 participants