What happens to the similarity for words with no vectors? #10483
Replies: 1 comment
-
Thanks for bringing this up! It turns out that the docs were a bit out-of-date here and we'll get it updated soon in #10486. If the token doesn't have a vector, then you currently get an all-0 vector instead, so in these examples it's averaging all-0 vectors for the OOV tokens with the vector for "stören" to get the doc vector, and they end up pretty similar. For text similarity for short texts, we can recommend using a library like And if 60% of your tokens don't have vectors, then it might make sense to train your own vectors. If it's a case where you texts have a lot of OOV compounds, then I could recommend using floret vectors instead, but we don't have a provided pipeline with them for German yet. If you're interested in training vectors yourself (you'd need a machine with a lot of hard drive space and a good CPU), I can point you to some demos and sample projects. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone :-)
I am trying to calculate the correlation of two text columns in a dataframe with very specific german language. I am using the spacy "de_core_news_lg" Model. Only about 60% of my tokens happen to have a vector in the model. I am calculating the similarity with the spacy .similarity method.
I can't find anything in the documentation about what happens to the words with no vector. How are they included in the similarity score?
Example : For the row x my columns " pzb stören luftverlust" and "pzb stören" score a similarity of 0.9999999582115684. Even though "pzb" does of course not have a word vector.
It seems to me that those not matched word are included in the score somehow?
Any information highly appreciated.
Thanks in advance
Eva
Beta Was this translation helpful? Give feedback.
All reactions