Skip to content
Discussion options

You must be logged in to vote

Short answer: to do what you want, you should add a hook that returns zero vectors for OOV terms. For Transformers there may not be a reasonable way to do this.

In more detail... Besides the recent floret releases, spaCy pipelines don't use subwords for word vectors. However, if word vectors are not present, then the tok2vec contextual embedding can be returned, which happens in the small and Transformers pipelines.

For Transformers, token vectors are calculated by mapping subwords generated by the underlying HuggingFace tokenizer to spaCy tokens and then combining the embeddings of the underlying tokens (see the docs). The Transformers tokenizers are designed to use subwords in a way tha…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / vectors Feature: Word vectors and similarity
2 participants
Converted from issue

This discussion was converted from issue #10565 on March 31, 2022 10:29.