Near deduplication #7 only operates on file level. It is also possible for a file to be
- a substring of another file, while the minhash/simhash fingerprints being wildly different
- composed of multiple snippets from different sources
Do we do something about them, knowing they contains large chunks of repeated snippets?