Skip to content

jaccard() on hash vectors is inaccurate #37

@inkybutton

Description

@inkybutton

Describe the bug
Hey there, thank you for creating this library! I like the work you've put into the documentation. I'm a newbie when it comes to Julia and LSH so apologies if I get things wrong.
I tried running jaccard() on hash vectors after using MinHash and the default base.hash hashing method, and received puzzling results, sometimes with similarity value above 1, which should be impossible. I think this is due to summing large hash values causing an overflow.
This is probably not the intended use of the function - collision_probability seems like the right function. I got drawn to do this because explanations of Minhash I've seen say that the Jaccard index for hashes should approximate the value for the original shingles. I suspect other learners may be led to do this too.

To Reproduce

hashfn = LSHFunctions.MinHash(15)
A = Set(["ab", "bc", "cd"]);
B = Set(["xy", "yz", "za"])
hashes_A = hashfn(A)
hashes_B = hashfn(B)
jaccard(hashes_A, hashes_B)

This results in similarity >0, sometimes >1.

Expected behavior
Because there is no overlap between the sets, the Jaccard index for their minhashes should always be 0. A naïve implementation that does not use arithmetic operations seems to get it right, e.g.:

function jaccard(x::AbstractVector, y::AbstractVector)::Float64
    length(A ∩ B) / length(A ∪ B)
end

Or perhaps something that might clarify which function to use!

Screenshots
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions