How remove similar items in big datasets? #12014

info2000 · 2022-12-21T23:08:41Z

info2000
Dec 21, 2022

There's some idea to find and remove similar items in datasets?
I want to my eval datasets be the most diversity and light possible, finding and removing similar items, without running a loop of each element against the others elements

Thanks

Answered by polm

Dec 22, 2022

Finding similar documents is a whole field of research so it's hard to give a brief summary, but you could use one of the spaCy pipelines with word vectors and a nearest neighbor search library like Annoy as a starting point - it should be much faster than a nested for loop. You would have to define yourself how similar an item has to be before it needs to be removed.

View full answer

polm · 2022-12-22T02:55:13Z

polm
Dec 22, 2022

Finding similar documents is a whole field of research so it's hard to give a brief summary, but you could use one of the spaCy pipelines with word vectors and a nearest neighbor search library like Annoy as a starting point - it should be much faster than a nested for loop. You would have to define yourself how similar an item has to be before it needs to be removed.

1 reply

info2000 Dec 28, 2022
Author

Thanks, using Annoy works really faster.
if someone is on this situation, I filter all the samples with distance below 0.15,
A dataset with 1 millon examples takes minutes vs hours doing similarity loop over items loops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How remove similar items in big datasets? #12014

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How remove similar items in big datasets? #12014

Uh oh!

info2000 Dec 21, 2022

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

polm Dec 22, 2022

Uh oh!

info2000 Dec 28, 2022 Author

info2000
Dec 21, 2022

Replies: 1 comment 1 reply

polm
Dec 22, 2022

info2000 Dec 28, 2022
Author