Vectors corresponding to other language #10013

sonynavdeep81 · 2022-01-10T06:04:48Z

sonynavdeep81
Jan 10, 2022

Hi. I have a very basic question. I am creating a blank spacy model with 'en' language. As we know, that it is a blank model, it does not contain token vectors. Now I add the tok2vec component using add_pipe from the en_core_web_sm pipeline and thus now it contains token vectors for english text. The real surprise for me is that all these vectors corresponds to english language. Now if rather than providing the english text, I pass punjabi text (another language for which the support has not yet been provided by spacy) then also it returns the vectors as can be seen in the below code and corresponding output;

Although we are using english language then how does it contain vectors corresponding to punjabi language? Please explain.

Answered by adrianeboyd

Jan 10, 2022

This is a confusing legacy backoff behavior from doc.tensor to doc.vector.

What you're seeing here aren't static word vectors like from word2vec or glove, but the context-sensitive tensors from the tok2vec component. The tok2vec component is able to generate a vector for any token, but they're not really useful for anything other than the following pipeline components (tagger, parser). They're not particularly good for word similarity. See the first yellow warning box here: https://spacy.io/usage/linguistic-features#vectors-similarity

If you download en_core_web_md or en_core_web_lg, you'll see static word vectors under token.vector with OOV (all zero) vectors for unknown words like Punja…

View full answer

adrianeboyd · 2022-01-10T11:35:55Z

adrianeboyd
Jan 10, 2022

This is a confusing legacy backoff behavior from doc.tensor to doc.vector.

What you're seeing here aren't static word vectors like from word2vec or glove, but the context-sensitive tensors from the tok2vec component. The tok2vec component is able to generate a vector for any token, but they're not really useful for anything other than the following pipeline components (tagger, parser). They're not particularly good for word similarity. See the first yellow warning box here: https://spacy.io/usage/linguistic-features#vectors-similarity

If you download en_core_web_md or en_core_web_lg, you'll see static word vectors under token.vector with OOV (all zero) vectors for unknown words like Punjabi tokens.

2 replies

sonynavdeep81 Jan 10, 2022
Author

Thanks for the nice explanation. Can you please explain

What are context-sensitive tensors and how are they generated automatically?
What purpose do the context-sensitive tensors serve (What is their need)?
How are they useful for (tagger, parser)?
If we load static vectors then will tok2vec still generate vectors? If yes, what for?

Actually I am a beginner, so I am sorry if I am asking really basic questions. Thanks.

polm Jan 17, 2022

Please read the docs on pipelines and embeddings, which should answer your questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Vectors corresponding to other language #10013

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Vectors corresponding to other language #10013

Uh oh!

Uh oh!

sonynavdeep81 Jan 10, 2022

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Jan 10, 2022

Uh oh!

Uh oh!

sonynavdeep81 Jan 10, 2022 Author

Uh oh!

polm Jan 17, 2022

sonynavdeep81
Jan 10, 2022

Replies: 1 comment 2 replies

adrianeboyd
Jan 10, 2022

sonynavdeep81 Jan 10, 2022
Author