We extend our gratitude to the authors of this repository! Your documentation and code have greatly benefited the community.
We have used this repo in building the data processing pipeline tool SailCraft.
It consists of four stages: initial data cleaning, near deduplication, exact deduplication, and a second round of data cleaning.
Many thanks for your contribution for open research! And we welcome the developers to try SailCraft.