How remove similar items in big datasets? #12014
-
|
There's some idea to find and remove similar items in datasets? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Finding similar documents is a whole field of research so it's hard to give a brief summary, but you could use one of the spaCy pipelines with word vectors and a nearest neighbor search library like Annoy as a starting point - it should be much faster than a nested for loop. You would have to define yourself how similar an item has to be before it needs to be removed. |
Beta Was this translation helpful? Give feedback.
Finding similar documents is a whole field of research so it's hard to give a brief summary, but you could use one of the spaCy pipelines with word vectors and a nearest neighbor search library like Annoy as a starting point - it should be much faster than a nested for loop. You would have to define yourself how similar an item has to be before it needs to be removed.