Skip to content

[Near Deduplication] Post processing #9

@ChenghaoMou

Description

@ChenghaoMou

The current script building clusters of duplicates, but there are cases it might yield unwanted results:

When doc B is clustered under doc A's name, another doc C can also be clustered into B's name (AB, BC, C!~A), thus when we are deleting non "extreme"s from each cluster, we could end up having both A and B kept in the results.

A better way to delete duplicates is to find community within each connected components. This is used in https://github.com/src-d/gemini.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions