What happens to the similarity for words with no vectors? #10483

e412 · 2022-03-13T09:28:12Z

e412
Mar 13, 2022

Hello everyone :-)

I am trying to calculate the correlation of two text columns in a dataframe with very specific german language. I am using the spacy "de_core_news_lg" Model. Only about 60% of my tokens happen to have a vector in the model. I am calculating the similarity with the spacy .similarity method.

I can't find anything in the documentation about what happens to the words with no vector. How are they included in the similarity score?
Example : For the row x my columns " pzb stören luftverlust" and "pzb stören" score a similarity of 0.9999999582115684. Even though "pzb" does of course not have a word vector.

It seems to me that those not matched word are included in the score somehow?

Any information highly appreciated.

Thanks in advance

Eva

adrianeboyd · 2022-03-15T08:02:43Z

adrianeboyd
Mar 15, 2022

Thanks for bringing this up! It turns out that the docs were a bit out-of-date here and we'll get it updated soon in #10486.

If the token doesn't have a vector, then you currently get an all-0 vector instead, so in these examples it's averaging all-0 vectors for the OOV tokens with the vector for "stören" to get the doc vector, and they end up pretty similar.

For text similarity for short texts, we can recommend using a library like sentence-transformers instead, which works better than averaging the static token vectors across a doc. There's also a third-party spacy add-on for this that you could try out (I haven't tested it myself recently, though): https://spacy.io/universe/project/spacy-sentence-bert

And if 60% of your tokens don't have vectors, then it might make sense to train your own vectors. If it's a case where you texts have a lot of OOV compounds, then I could recommend using floret vectors instead, but we don't have a provided pipeline with them for German yet. If you're interested in training vectors yourself (you'd need a machine with a lot of hard drive space and a good CPU), I can point you to some demos and sample projects.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

What happens to the similarity for words with no vectors? #10483

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

What happens to the similarity for words with no vectors? #10483

Uh oh!

Uh oh!

e412 Mar 13, 2022

Replies: 1 comment

Uh oh!

adrianeboyd Mar 15, 2022

e412
Mar 13, 2022

adrianeboyd
Mar 15, 2022