Provide results on large dataset with different near deduplication methods: 1. minhash + lsh 2. simhash 3. any relevant methods Details to be included: - tokenization method - method parameters - hardware - memory usage - time - duplication results, examples