-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
The current script building clusters of duplicates, but there are cases it might yield unwanted results:
When doc B is clustered under doc A's name, another doc C can also be clustered into B's name (AB, BC, C!~A), thus when we are deleting non "extreme"s from each cluster, we could end up having both A and B kept in the results.
A better way to delete duplicates is to find community within each connected components. This is used in https://github.com/src-d/gemini.
Metadata
Metadata
Assignees
Labels
No labels