[Near Deduplication] Post processing

The current script building clusters of duplicates, but there are cases it might yield unwanted results:

When doc B is clustered under doc A's name, another doc C can also be clustered into B's name (A~B, B~C, C!~A), thus when we are deleting non "extreme"s from each cluster, we could end up having both A and B kept in the results.

A better way to delete duplicates is to find community within each connected components. This is used in https://github.com/src-d/gemini.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Near Deduplication] Post processing #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Near Deduplication] Post processing #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions