Using jiebaR package (SimHash algorithm)

Hello

Here are 2 texts I would like to check for near duplicate thanks to the SimHash algorithm (jiebaR package):

     library(jiebaR)
     coder <- "Simhash detects near duplicates and not exact duplicates"
     codel <- "SimHash is a technique for quickly detect near duplicates"

I have create a worker called "simhasher":

     simhasher = worker("simhash", topn = 5)
     simhasher <= codel

Then I have computed the distance:

     distance(codel, coder, simhasher)

Here is the result:

     $distance
     [1] 22

     $lhs
     11.7392      11.7392      11.7392      11.7392      11.7392 
     "duplicates"  "technique"    "SimHash"     "detect"    "quickly" 

     $rhs
     23.4784      11.7392      11.7392      11.7392 
     "duplicates"    "Simhash"    "detects"      "exact" 

I need you help on 3 things:

1. the distance is 22. The bigger the distance is, the more the 2 texts are different. Here texts seems REALLY close, so I was expected the distante to be smaller... Can you please explain me this result?

2. What are the figures above the words in lhs and rhs ? (e.g: 11.7392 , 23.4784)

3. I also checked the worker I have created :

     simhasher <= codel

And here is the result I discovered:

     $simhash
     [1] "12382334418040220206"

     $keyword
     11.7392      11.7392      11.7392      11.7392      11.7392 
     "duplicates"  "technique"    "SimHash"     "detect"    "quickly" 

What is the simhash here and why do I need to create it before to run the distance function? This part is not really clear to me and not really explained inside the package documentation.

Can you please help me? This package seems really powerfull but I feel like I only understand 5% of it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using jiebaR package (SimHash algorithm) #66

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Using jiebaR package (SimHash algorithm) #66

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions