Skip to content
Discussion options

You must be logged in to vote

Finding similar documents is a whole field of research so it's hard to give a brief summary, but you could use one of the spaCy pipelines with word vectors and a nearest neighbor search library like Annoy as a starting point - it should be much faster than a nested for loop. You would have to define yourself how similar an item has to be before it needs to be removed.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@info2000
Comment options

Answer selected by info2000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / vectors Feature: Word vectors and similarity
2 participants