Skip to content

Using jiebaR package (SimHash algorithm) #66

@remibacha

Description

@remibacha

Hello

Here are 2 texts I would like to check for near duplicate thanks to the SimHash algorithm (jiebaR package):

 library(jiebaR)
 coder <- "Simhash detects near duplicates and not exact duplicates"
 codel <- "SimHash is a technique for quickly detect near duplicates"

I have create a worker called "simhasher":

 simhasher = worker("simhash", topn = 5)
 simhasher <= codel

Then I have computed the distance:

 distance(codel, coder, simhasher)

Here is the result:

 $distance
 [1] 22

 $lhs
 11.7392      11.7392      11.7392      11.7392      11.7392 
 "duplicates"  "technique"    "SimHash"     "detect"    "quickly" 

 $rhs
 23.4784      11.7392      11.7392      11.7392 
 "duplicates"    "Simhash"    "detects"      "exact" 

I need you help on 3 things:

  1. the distance is 22. The bigger the distance is, the more the 2 texts are different. Here texts seems REALLY close, so I was expected the distante to be smaller... Can you please explain me this result?

  2. What are the figures above the words in lhs and rhs ? (e.g: 11.7392 , 23.4784)

  3. I also checked the worker I have created :

    simhasher <= codel

And here is the result I discovered:

 $simhash
 [1] "12382334418040220206"

 $keyword
 11.7392      11.7392      11.7392      11.7392      11.7392 
 "duplicates"  "technique"    "SimHash"     "detect"    "quickly" 

What is the simhash here and why do I need to create it before to run the distance function? This part is not really clear to me and not really explained inside the package documentation.

Can you please help me? This package seems really powerfull but I feel like I only understand 5% of it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions