Skip to content

Commit 656d3cf

Browse files
authored
Merge pull request #2 from dapper91/dev
- shingle hash implemented. - README added. - github workflows added.
2 parents 391c63a + 4e84067 commit 656d3cf

File tree

9 files changed

+450
-0
lines changed

9 files changed

+450
-0
lines changed

.github/workflows/release.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
name: release
2+
3+
on:
4+
release:
5+
types:
6+
- released
7+
8+
jobs:
9+
release:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v2
13+
- name: Install Rust
14+
run: rustup update stable
15+
- name: Build and publish
16+
run: |
17+
cargo login ${{ secrets.CRATESIO_TOKEN }}
18+
cargo publish

.github/workflows/test.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
name: test
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- dev
7+
- master
8+
push:
9+
branches:
10+
- master
11+
12+
jobs:
13+
test:
14+
name: Test crate
15+
runs-on: ubuntu-latest
16+
steps:
17+
- uses: actions/checkout@v2
18+
- name: Install Rust
19+
run: rustup update stable
20+
- name: Run tests
21+
run: cargo test

Cargo.toml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
[package]
2+
name = "schindel"
3+
version = "0.1.0"
4+
edition = "2021"
5+
6+
license = "Unlicense"
7+
description = "rust min-shingle hashing"
8+
readme = "README.md"
9+
10+
homepage = "https://github.com/dapper91/schindel"
11+
documentation = "https://docs.rs/schindel/"
12+
repository = "https://github.com/dapper91/schindel"
13+
14+
categories = ["algorithms", "text-processing"]
15+
keywords = ["shingles", "minshingle", "ngrams", "fuzzy", "hashing"]
16+
17+
[dependencies]
18+
murmurhash3 = "0.0.5"

README.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
[![Crates.io][crates-badge]][crates-url]
2+
[![License][licence-badge]][licence-url]
3+
[![Test Status][test-badge]][test-url]
4+
[![Documentation][doc-badge]][doc-url]
5+
6+
[crates-badge]: https://img.shields.io/crates/v/schindel.svg
7+
[crates-url]: https://crates.io/crates/schindel
8+
[licence-badge]: https://img.shields.io/badge/license-Unlicense-blue.svg
9+
[licence-url]: https://github.com/dapper91/schindel/blob/master/LICENSE
10+
[test-badge]: https://github.com/dapper91/schindel/actions/workflows/test.yml/badge.svg?branch=master
11+
[test-url]: https://github.com/dapper91/schindel/actions/workflows/test.yml
12+
[doc-badge]: https://docs.rs/schindel/badge.svg
13+
[doc-url]: https://docs.rs/schindel
14+
15+
16+
# Rust min-shingle hashing implementation
17+
18+
This crate implements simple min-shingle hashing algorithm.
19+
For more information see [W-shingling](https://en.wikipedia.org/wiki/W-shingling).
20+
21+
22+
# Algorithm
23+
24+
Shingle hash (or w-shingle) is a set of n-grams each of which composed of contiguous tokens within an input sequence
25+
shifted by one element. For example, the document:
26+
27+
`to be or not to be that is the question`
28+
29+
has the following set of 2-grams (shingles):
30+
31+
`(to, be)`, `(be, or)`, `(or, not)`, `(not, to)`, `(be, that)`, `(that, is)`, `(is, the)`, `(the, question)`
32+
33+
*note*: 2-gram `(to, be)` occurs twice.
34+
35+
The 2-gram set is a document shingle hash.
36+
That hash can be used to measure two documents resemblance using Jaccard coefficient:
37+
38+
`R(doc1, doc2) = (H(doc1) ⋂ H(doc2)) / (H(doc1) ⋃ H(doc2))`
39+
40+
where:
41+
- `R` - resemblance
42+
- `H` - shingle hash
43+
44+
The previous algorithm is not scalable to large documents because an n-gram set could grow very fast.
45+
For example, if 3-grams is used and input sequence alphabet is 255 symbols then the set could be of size
46+
`255 ^ 3` or `~16 * 10 ^ 6` in worst case which consumes a lot of memory.
47+
48+
To resolve that problem min-shingle algorithm is used. It exploits special optimisation technic:
49+
instead of storing all sequence n-grams n-gram hashes are calculated and a minimal hash value is saved.
50+
Because the minimal value of a data stream can be calculated on the fly (without saving all the values),
51+
memory consumption is drastically reduced. Repeating that process with several hash functions
52+
(or several hash function seeds) shingle hash is produced.
53+
As well as shingle hash min-shingle hash can be used to measure distance (or resemblance) between documents.
54+
55+
# Basic example
56+
57+
Add `schindel` dependency to `Cargo.toml`:
58+
59+
```toml
60+
[dependencies]
61+
schindel = "^0.1.0"
62+
```
63+
64+
Add the following code to your `main.rs`:
65+
66+
``` rust
67+
use schindel::shingles::{MinShingleHash, Murmur3Hasher};
68+
69+
fn main() {
70+
let original = "\
71+
“My sight is failing,” she said finally. “Even when I was young I could not have read what was written there. \
72+
But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be, \
73+
Benjamin?” For once Benjamin consented to break his rule, and he read out to her what was written on the wall. \
74+
There was nothing there now except a single Commandment. It ran:\
75+
ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS";
76+
77+
let plagiarism = "\
78+
“My sight is failing,” she said finally. “When I was young I could not have read what was written there. \
79+
But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be” \
80+
Benjamin read out to her what was written. There was nothing there now except a single Commandment. \
81+
It ran: ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS";
82+
83+
let other = "\
84+
Throughout the spring and summer they worked a sixty-hour week, and in August Napoleon announced that there \
85+
would be work on Sunday afternoons as well. This work was strictly voluntary, but any animal who absented \
86+
himself from it would have his rations reduced by half. Even so, it was found necessary to leave certain \
87+
tasks undone. The harvest was a little less successful than in the previous year, and two fields which \
88+
should have been sown with roots in the early summer were not sown because the ploughing had not been \
89+
completed early enough. It was possible to foresee that the coming winter would be a hard one.";
90+
91+
const HASH_LEN: usize = 100;
92+
const NGRAM_LEN: usize = 5;
93+
94+
let original_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(original.chars());
95+
96+
let plagiarism_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(plagiarism.chars());
97+
println!("plagiarism similarity: {}", original_hash.compare(&plagiarism_hash));
98+
99+
let other_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(other.chars());
100+
println!("other text similarity: {}", original_hash.compare(&other_hash));
101+
}
102+
```

examples/quickstart.rs

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
use schindel::shingles::{MinShingleHash, Murmur3Hasher};
2+
3+
fn main() {
4+
let original = "\
5+
“My sight is failing,” she said finally. “Even when I was young I could not have read what was written there. \
6+
But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be, \
7+
Benjamin?” For once Benjamin consented to break his rule, and he read out to her what was written on the wall. \
8+
There was nothing there now except a single Commandment. It ran:\
9+
ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS";
10+
11+
let plagiarism = "\
12+
“My sight is failing,” she said finally. “When I was young I could not have read what was written there. \
13+
But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be” \
14+
Benjamin read out to her what was written. There was nothing there now except a single Commandment. \
15+
It ran: ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS";
16+
17+
let other = "\
18+
Throughout the spring and summer they worked a sixty-hour week, and in August Napoleon announced that there \
19+
would be work on Sunday afternoons as well. This work was strictly voluntary, but any animal who absented \
20+
himself from it would have his rations reduced by half. Even so, it was found necessary to leave certain \
21+
tasks undone. The harvest was a little less successful than in the previous year, and two fields which \
22+
should have been sown with roots in the early summer were not sown because the ploughing had not been \
23+
completed early enough. It was possible to foresee that the coming winter would be a hard one.";
24+
25+
const HASH_LEN: usize = 100;
26+
const NGRAM_LEN: usize = 5;
27+
28+
let original_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(original.chars());
29+
30+
let plagiarism_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(plagiarism.chars());
31+
println!("plagiarism similarity: {}", original_hash.compare(&plagiarism_hash));
32+
33+
let other_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(other.chars());
34+
println!("other text similarity: {}", original_hash.compare(&other_hash));
35+
}

rustfmt.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
edition = "2018"
2+
max_width = 120

src/hasher.rs

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
//! Shingle hasher implementation.
2+
3+
use std::hash::Hasher;
4+
5+
use murmurhash3::murmurhash3_x86_32;
6+
7+
/// A trait for hasher builder with custom seed value.
8+
pub trait SeedHasher {
9+
type HasherType: Hasher;
10+
11+
/// Creates a hasher with provided seed value.
12+
fn with_seed(seed: u32) -> Self::HasherType;
13+
}
14+
pub struct Murmur3Hasher {
15+
seed: u32,
16+
bytes: Vec<u8>,
17+
}
18+
19+
impl Hasher for Murmur3Hasher {
20+
fn finish(&self) -> u64 {
21+
return murmurhash3_x86_32(&self.bytes, self.seed) as u64;
22+
}
23+
24+
fn write(&mut self, bytes: &[u8]) {
25+
self.bytes.extend(bytes);
26+
}
27+
}
28+
29+
impl SeedHasher for Murmur3Hasher {
30+
type HasherType = Murmur3Hasher;
31+
32+
fn with_seed(seed: u32) -> Self::HasherType {
33+
return Murmur3Hasher { seed, bytes: vec![] };
34+
}
35+
}

src/lib.rs

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
//! Rust min-shingle hashing implementation.
2+
//! This crate implements simple min-shingle hashing algorithm.
3+
//! For more information see [W-shingling](https://en.wikipedia.org/wiki/W-shingling).
4+
//!
5+
//! # Algorithm
6+
//!
7+
//! Shingle hash (or w-shingle) is a set of n-grams each of which composed of contiguous tokens within an input sequence
8+
//! shifted by one element. For example, the document:
9+
//!
10+
//! `to be or not to be that is the question`
11+
//!
12+
//! has the following set of 2-grams (shingles):
13+
//!
14+
//! `(to, be)`, `(be, or)`, `(or, not)`, `(not, to)`, `(be, that)`, `(that, is)`, `(is, the)`, `(the, question)`
15+
//!
16+
//! *note*: 2-gram `(to, be)` occurs twice.
17+
//!
18+
//! The 2-gram set is a document shingle hash.
19+
//! That hash can be used to measure two documents resemblance using Jaccard coefficient:
20+
//!
21+
//! `R(doc1, doc2) = (H(doc1) ⋂ H(doc2)) / (H(doc1) ⋃ H(doc2))`
22+
//!
23+
//! where:
24+
//! - `R` - resemblance
25+
//! - `H` - shingle hash
26+
//!
27+
//! The previous algorithm is not scalable to large documents because an n-gram set could grow very fast.
28+
//! For example, if 3-grams is used and input sequence alphabet is 255 symbols then the set could be of size
29+
//! `255 ^ 3` or `~16 * 10 ^ 6` in worst case which consumes a lot of memory.
30+
//!
31+
//! To resolve that problem min-shingle algorithm is used. It exploits special optimisation technic:
32+
//! instead of storing all sequence n-grams n-gram hashes are calculated and a minimal hash value is saved.
33+
//! Because the minimal value of a data stream can be calculated on the fly (without saving all the values),
34+
//! memory consumption is drastically reduced. Repeating that process with several hash functions
35+
//! (or several hash function seeds) shingle hash is produced.
36+
//! As well as shingle hash min-shingle hash can be used to measure distance (or resemblance) between documents.
37+
//!
38+
//! # Examples:
39+
//!
40+
//! ```
41+
//! use schindel::shingles::{MinShingleHash, Murmur3Hasher};
42+
//!
43+
//! fn main() {
44+
//! let original = "\
45+
//! “My sight is failing,” she said finally. “Even when I was young I could not have read what was written there. \
46+
//! But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be, \
47+
//! Benjamin?” For once Benjamin consented to break his rule, and he read out to her what was written on the wall. \
48+
//! There was nothing there now except a single Commandment. It ran:\
49+
//! ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS";
50+
//!
51+
//! let plagiarism = "\
52+
//! “My sight is failing,” she said finally. “When I was young I could not have read what was written there. \
53+
//! But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be” \
54+
//! Benjamin read out to her what was written. There was nothing there now except a single Commandment. \
55+
//! It ran: ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS";
56+
//!
57+
//! let other = "\
58+
//! Throughout the spring and summer they worked a sixty-hour week, and in August Napoleon announced that there \
59+
//! would be work on Sunday afternoons as well. This work was strictly voluntary, but any animal who absented \
60+
//! himself from it would have his rations reduced by half. Even so, it was found necessary to leave certain \
61+
//! tasks undone. The harvest was a little less successful than in the previous year, and two fields which \
62+
//! should have been sown with roots in the early summer were not sown because the ploughing had not been \
63+
//! completed early enough. It was possible to foresee that the coming winter would be a hard one.";
64+
//!
65+
//! const HASH_LEN: usize = 100;
66+
//! const NGRAM_LEN: usize = 5;
67+
//!
68+
//! let original_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(original.chars());
69+
//!
70+
//! let plagiarism_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(plagiarism.chars());
71+
//! println!("plagiarism similarity: {}", original_hash.compare(&plagiarism_hash));
72+
//!
73+
//! let other_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(other.chars());
74+
//! println!("other text similarity: {}", original_hash.compare(&other_hash));
75+
//! }
76+
//! ```
77+
78+
pub mod hasher;
79+
pub mod shingles;

0 commit comments

Comments
 (0)