Different (and incorrect) similarity scores across different spaCy versions #10906

zkytony · 2022-06-02T23:57:00Z

zkytony
Jun 2, 2022

How to reproduce the behaviour

Try running the following piece of code

import spacy
spacy_model = spacy.load("en_core_web_md")
above = spacy_model("above")
front = spacy_model("front")
above.similarity(front)

for spaCy 2.2.4 and spaCy 3.3.0. You will get different numbers.

with 2.2.4: 0.4393605638230988
with 3.3.0: 1.000000073434088

Notice that "above" and "front" are different words, while for 3.3.0,
they are more similar than "front" vs "front" (in terms of absolute value of the similarity score)

Your Environment

Operating System: Ubuntu 20.04
Python Version Used: 3.8.10
spaCy Version Used: 2.2.4 or 3.3.0
Environment Information: Ubuntu 20.04 desktop

Answered by adrianeboyd

Jun 3, 2022

If you use the en_core_web_lg vectors I think you should get the same (or extremely similar) results for spacy v2.2-v3.3.

In the en_core_web_md models, the vectors are pruned so that multiple words get clustered together with the same vector. The pruning step isn't deterministic, so each version of en_core_web_md may have slightly different clusters and vectors.

From your results, it looks like above and front ended up in the same cluster in v3.3.0 but not in v2.2.5 (model versions that you can see with pip freeze or spacy validate, not the spacy version). There were also some minor changes related to the vector deduplication in v3.3.0 that affected the English vectors in particular (#10551…

View full answer

adrianeboyd · 2022-06-03T07:15:05Z

adrianeboyd
Jun 3, 2022

If you use the en_core_web_lg vectors I think you should get the same (or extremely similar) results for spacy v2.2-v3.3.

In the en_core_web_md models, the vectors are pruned so that multiple words get clustered together with the same vector. The pruning step isn't deterministic, so each version of en_core_web_md may have slightly different clusters and vectors.

From your results, it looks like above and front ended up in the same cluster in v3.3.0 but not in v2.2.5 (model versions that you can see with pip freeze or spacy validate, not the spacy version). There were also some minor changes related to the vector deduplication in v3.3.0 that affected the English vectors in particular (#10551).

In general, there's no guarantee from our side that different versions of an English pipeline will have the exact same vectors (or the exact same results for other components like the tagger), so you if you need the exact results from one particular version, be sure to specify the exact model version in your project requirements along with the exact spacy version.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Different (and incorrect) similarity scores across different spaCy versions #10906

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Different (and incorrect) similarity scores across different spaCy versions #10906

Uh oh!

Uh oh!

zkytony Jun 2, 2022

How to reproduce the behaviour

Your Environment

Replies: 1 comment

Uh oh!

adrianeboyd Jun 3, 2022

zkytony
Jun 2, 2022

adrianeboyd
Jun 3, 2022