most_similar() : strange results #12720

DataAndMaths · 2023-06-12T23:47:37Z

DataAndMaths
Jun 12, 2023

With

nlp = spacy.load("en_core_web_md")

your_word = "country"

ms = nlp.vocab.vectors.most_similar( np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
print(words)

I get strange results :

['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']

With other words :

"dog" : ['dogsbody', 'wolfdogs', 'Baeg', 'duppy', 'pet(s', 'postcanine', 'Kebira', 'uppies', 'Toropets', 'moggie']
"king" : ['kingi', 'musulmanes', 'princedoms', 'Akkarin', "d'annuler", 'Olabode', 'mucronate', 'Roulers', 'Silverthrone', 'Jovinus'].

Answered by kadarakos

Jun 13, 2023

Hey DataAndMaths,

For the en_core_web_md model we prune the the vector tables to save memory. For this usecase it might be worth trying out the en_core_web_lg pre-trained model instead. Here is the similarity list it returns for "country":

['country', 'country-', 'country\x92s', 'country`s', 'country"s', 'countryâ€', 'countrys', 'country—0,467', 'country--', 'countr', 'countryâ\x80\x99s', 'lowcountry', 'Upcountry', 'upcountry', 'countrywomen', 'countrywide', 'Lowcountry', 'thecountry', 'intercountry', 'countrywoman', 'countries-', 'nation', 'Westcountry', 'countrymen', 'countryman', 'countries', 'continent', 'countrysides', 'Kountry', 'countrified', 'nationâ\x80\x99s', 'countryCredit', 'n…

View full answer

kadarakos · 2023-06-13T09:54:43Z

kadarakos
Jun 13, 2023

Hey DataAndMaths,

For the en_core_web_md model we prune the the vector tables to save memory. For this usecase it might be worth trying out the en_core_web_lg pre-trained model instead. Here is the similarity list it returns for "country":

['country', 'country-', 'country\x92s', 'country`s', 'country"s', 'countryâ€', 'countrys', 'country—0,467', 'country--', 'countr', 'countryâ\x80\x99s', 'lowcountry', 'Upcountry', 'upcountry', 'countrywomen', 'countrywide', 'Lowcountry', 'thecountry', 'intercountry', 'countrywoman', 'countries-', 'nation', 'Westcountry', 'countrymen', 'countryman', 'countries', 'continent', 'countrysides', 'Kountry', 'countrified', 'nationâ\x80\x99s', 'countryCredit', 'nations', 'backcountry', 'nationalities', 'countryside', 'nationhood', 'cities', 'stateside', 'nationals', 'region', 'continents', 'states-', 'nationalising', 'nationally', 'world', 'homelands', 'governmentality', 'countout', 'region-']

2 replies

DataAndMaths Jun 19, 2023
Author

Thanks kadarakos,

I've actually also tried what you indicate, the results are indeed more consistent.
But it was so weird with "en_core_wed_md" (and then so different from the tuto) that I was wondering if there was a problem somewhere or if it's perfectly normal.

kadarakos Jun 23, 2023

Oh, a subtle detail is that for each version of the en_core_web_md model, we might prune the vector tables differently so you can end up with a different vocabulary between different versions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

most_similar() : strange results #12720

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

most_similar() : strange results #12720

Uh oh!

DataAndMaths Jun 12, 2023

Replies: 1 comment · 2 replies

Uh oh!

kadarakos Jun 13, 2023

Uh oh!

DataAndMaths Jun 19, 2023 Author

Uh oh!

kadarakos Jun 23, 2023

DataAndMaths
Jun 12, 2023

Replies: 1 comment 2 replies

kadarakos
Jun 13, 2023

DataAndMaths Jun 19, 2023
Author