Similarity scores quite a bit different between en_core_web_lg-3.1.0 and en_core_web_lg-3.4.1. Why? #12266

bb150000 · 2023-02-09T17:30:42Z

bb150000
Feb 9, 2023

TLDR: We are doing a ton (tens of millions) of entity.similarity checks between entities and search terms. The person that designed the scoring algorithm was using en_core_web_lg-3.1.0 whereas I (the person tasked to load the results into a database) am using en_core_web_lg-3.4.1. The scoring results from the entity.similarity checks are wildly different between the two versions. Am I doing something wrong?

The organization that I work for has many (1000+) official documents that govern policy, standards, etc. We believe that there is a lot of duplication/redundancy in these documents and are trying to merge similar documents or remove redundant ones in a quest for efficiency.
We have decided that to look for similar documents, we would convert them all to text, read them into spaCy (NLP(doc_txt)), get rid of the entities generated by the NLP action, replace these entities with the noun chunks identified within the document, then compare these noun chunks to a common set of search terms that are also read into spaCy (NLP) using the similarity function. The noun chunk/search term pairs that have a high enough similarity score will be considered a match. The matched search terms will then be used to calculate a cosine similarity between all combinations of documents. Those combinations that score high enough will become candidates for merging/elimination.
A colleague developed the spaCy scoring algorithm between the noun chunks turned entities and search terms. I am responsible for executing the algorithm against all of the documents, storing the results in a database as I go. She developed the algorithm using the en_core_web_lg-3.1.0 library. When I started adapting the code for my purposes, I installed the most recent version at the time, en_core_web_lg-3.4.1. Our similarity scores are wildly different. If I load the en_core_web_lg-3.1.0 library, I get the same results that she did. 3.4.1 seems to score things quite a bit higher so we end up with many more “matches” using the newer version. Any idea why the scores are so different? Which one should I trust as being the most accurate? Thanks in advance!

Answered by adrianeboyd

Feb 9, 2023

The English words vectors were updated between v3.3.x and v3.4.x models and they're completely different/unrelated vectors. The new vectors are slightly better for use in the trained pipelines (or we wouldn't have updated them!), but you'd just have to check for your task. The old vectors weren't case-sensitive, if that matters for your task.

View full answer

adrianeboyd · 2023-02-09T20:52:22Z

adrianeboyd
Feb 9, 2023

The English words vectors were updated between v3.3.x and v3.4.x models and they're completely different/unrelated vectors. The new vectors are slightly better for use in the trained pipelines (or we wouldn't have updated them!), but you'd just have to check for your task. The old vectors weren't case-sensitive, if that matters for your task.

1 reply

bb150000 Feb 10, 2023
Author

Thank you!

honnibal · 2023-02-09T23:23:14Z

honnibal
Feb 9, 2023
Maintainer

It's possible to use the old vectors with v3.5, if that helps you. You can load in the old version, save out the vectors, run the new version, load a pipeline, change the vectors, and save the result.

However, in the *_lg pipelines, the vectors are used as features in the component models. This means that if you run those models with vectors other than what the models were trained with, the performance will be really bad (perhaps no better than chance). If you use the *_sm models, or don't need to use the pipeline components, it should all work though.

1 reply

bb150000 Feb 10, 2023
Author

Thank you! I'll futz around with it and see what I can get.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Similarity scores quite a bit different between en_core_web_lg-3.1.0 and en_core_web_lg-3.4.1. Why? #12266

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Similarity scores quite a bit different between en_core_web_lg-3.1.0 and en_core_web_lg-3.4.1. Why? #12266

Uh oh!

bb150000 Feb 9, 2023

Replies: 2 comments · 2 replies

Uh oh!

adrianeboyd Feb 9, 2023

Uh oh!

bb150000 Feb 10, 2023 Author

Uh oh!

honnibal Feb 9, 2023 Maintainer

Uh oh!

bb150000 Feb 10, 2023 Author

bb150000
Feb 9, 2023

Replies: 2 comments 2 replies

adrianeboyd
Feb 9, 2023

bb150000 Feb 10, 2023
Author

honnibal
Feb 9, 2023
Maintainer

bb150000 Feb 10, 2023
Author