Similarity scores quite a bit different between en_core_web_lg-3.1.0 and en_core_web_lg-3.4.1. Why? #12266
-
TLDR: We are doing a ton (tens of millions) of entity.similarity checks between entities and search terms. The person that designed the scoring algorithm was using en_core_web_lg-3.1.0 whereas I (the person tasked to load the results into a database) am using en_core_web_lg-3.4.1. The scoring results from the entity.similarity checks are wildly different between the two versions. Am I doing something wrong?The organization that I work for has many (1000+) official documents that govern policy, standards, etc. We believe that there is a lot of duplication/redundancy in these documents and are trying to merge similar documents or remove redundant ones in a quest for efficiency. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
The English words vectors were updated between v3.3.x and v3.4.x models and they're completely different/unrelated vectors. The new vectors are slightly better for use in the trained pipelines (or we wouldn't have updated them!), but you'd just have to check for your task. The old vectors weren't case-sensitive, if that matters for your task. |
Beta Was this translation helpful? Give feedback.
-
It's possible to use the old vectors with v3.5, if that helps you. You can load in the old version, save out the vectors, run the new version, load a pipeline, change the vectors, and save the result. However, in the |
Beta Was this translation helpful? Give feedback.
The English words vectors were updated between v3.3.x and v3.4.x models and they're completely different/unrelated vectors. The new vectors are slightly better for use in the trained pipelines (or we wouldn't have updated them!), but you'd just have to check for your task. The old vectors weren't case-sensitive, if that matters for your task.