-
Notifications
You must be signed in to change notification settings - Fork 205
Description
Is your feature request related to a problem? Please describe.
Currently duplicate identification (exact & fuzzy) prefers larger batches of data (1-2GB) to better saturate the GPU for both IO and downstream operations like shuffle etc. while removal might benefit from smaller blocks (256MiB-512MiB) to reduce memory ovehead on CPU cores.
In fuzzy dedup all stages after minhash use large blocksizes so minhash is the only stage where perf might suffer from smaller blocks.
In exact dedup the shuffle stage suffers a lot from smaller blocks.
Describe the solution you'd like
A hybrid approach where exact dedup supported larger batches (number of input tasks). It would still allow performing I/O on smaller batches at some cost of not saturating the GPU. But collecting and inserting multiple batches at once into the shuffler using the list of input tasks might help significantly with shuffle perf.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.