[Near Deduplication] Tokenization

As we extend deduplication to a wide range of languages, what tokenization method to use will have an impact on the final results.

The current script uses a simple regex and uni-gram to perform minhash calculation. What are the consequences using a different configuration?