-
Notifications
You must be signed in to change notification settings - Fork 108
Description
Hello
Here are 2 texts I would like to check for near duplicate thanks to the SimHash algorithm (jiebaR package):
library(jiebaR)
coder <- "Simhash detects near duplicates and not exact duplicates"
codel <- "SimHash is a technique for quickly detect near duplicates"
I have create a worker called "simhasher":
simhasher = worker("simhash", topn = 5)
simhasher <= codel
Then I have computed the distance:
distance(codel, coder, simhasher)
Here is the result:
$distance
[1] 22
$lhs
11.7392 11.7392 11.7392 11.7392 11.7392
"duplicates" "technique" "SimHash" "detect" "quickly"
$rhs
23.4784 11.7392 11.7392 11.7392
"duplicates" "Simhash" "detects" "exact"
I need you help on 3 things:
-
the distance is 22. The bigger the distance is, the more the 2 texts are different. Here texts seems REALLY close, so I was expected the distante to be smaller... Can you please explain me this result?
-
What are the figures above the words in lhs and rhs ? (e.g: 11.7392 , 23.4784)
-
I also checked the worker I have created :
simhasher <= codel
And here is the result I discovered:
$simhash
[1] "12382334418040220206"
$keyword
11.7392 11.7392 11.7392 11.7392 11.7392
"duplicates" "technique" "SimHash" "detect" "quickly"
What is the simhash here and why do I need to create it before to run the distance function? This part is not really clear to me and not really explained inside the package documentation.
Can you please help me? This package seems really powerfull but I feel like I only understand 5% of it.