[Near Deduplication] Benchmark 

Provide results on large dataset with different near deduplication methods:

1. minhash + lsh
2. simhash
3. any relevant methods

Details to be included:
- tokenization method
- method parameters
- hardware
- memory usage
- time
- duplication results, examples