-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Describe the bug
Hey there, thank you for creating this library! I like the work you've put into the documentation. I'm a newbie when it comes to Julia and LSH so apologies if I get things wrong.
I tried running jaccard() on hash vectors after using MinHash and the default base.hash hashing method, and received puzzling results, sometimes with similarity value above 1, which should be impossible. I think this is due to summing large hash values causing an overflow.
This is probably not the intended use of the function - collision_probability
seems like the right function. I got drawn to do this because explanations of Minhash I've seen say that the Jaccard index for hashes should approximate the value for the original shingles. I suspect other learners may be led to do this too.
To Reproduce
hashfn = LSHFunctions.MinHash(15)
A = Set(["ab", "bc", "cd"]);
B = Set(["xy", "yz", "za"])
hashes_A = hashfn(A)
hashes_B = hashfn(B)
jaccard(hashes_A, hashes_B)
This results in similarity >0, sometimes >1.
Expected behavior
Because there is no overlap between the sets, the Jaccard index for their minhashes should always be 0. A naïve implementation that does not use arithmetic operations seems to get it right, e.g.:
function jaccard(x::AbstractVector, y::AbstractVector)::Float64
length(A ∩ B) / length(A ∪ B)
end
Or perhaps something that might clarify which function to use!