Skip to content
Discussion options

You must be logged in to vote

This is a confusing legacy backoff behavior from doc.tensor to doc.vector.

What you're seeing here aren't static word vectors like from word2vec or glove, but the context-sensitive tensors from the tok2vec component. The tok2vec component is able to generate a vector for any token, but they're not really useful for anything other than the following pipeline components (tagger, parser). They're not particularly good for word similarity. See the first yellow warning box here: https://spacy.io/usage/linguistic-features#vectors-similarity

If you download en_core_web_md or en_core_web_lg, you'll see static word vectors under token.vector with OOV (all zero) vectors for unknown words like Punja…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@sonynavdeep81
Comment options

@polm
Comment options

Answer selected by svlandeg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / vectors Feature: Word vectors and similarity
3 participants