Support batch sizes for shuffle stages/dedup stages.

**Is your feature request related to a problem? Please describe.**
Currently duplicate identification (exact & fuzzy) prefers larger batches of data (1-2GB) to better saturate the GPU for both IO and downstream operations like shuffle etc. while removal might benefit from smaller blocks (256MiB-512MiB) to reduce memory ovehead on CPU cores. 

In fuzzy dedup all stages after minhash use large blocksizes so minhash is the only stage where perf might suffer from smaller blocks.
In exact dedup the shuffle stage suffers a lot from smaller blocks.

**Describe the solution you'd like**
A hybrid approach where exact dedup supported larger batches (number of input tasks). It would still allow performing I/O on smaller batches at some cost of not saturating the GPU. But collecting and inserting multiple batches at once into the shuffler using the list of input tasks might help significantly with shuffle perf.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support batch sizes for shuffle stages/dedup stages. #1332

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support batch sizes for shuffle stages/dedup stages. #1332

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions