Skip to content

Support batch sizes for shuffle stages/dedup stages. #1332

@ayushdg

Description

@ayushdg

Is your feature request related to a problem? Please describe.
Currently duplicate identification (exact & fuzzy) prefers larger batches of data (1-2GB) to better saturate the GPU for both IO and downstream operations like shuffle etc. while removal might benefit from smaller blocks (256MiB-512MiB) to reduce memory ovehead on CPU cores.

In fuzzy dedup all stages after minhash use large blocksizes so minhash is the only stage where perf might suffer from smaller blocks.
In exact dedup the shuffle stage suffers a lot from smaller blocks.

Describe the solution you'd like
A hybrid approach where exact dedup supported larger batches (number of input tasks). It would still allow performing I/O on smaller batches at some cost of not saturating the GPU. But collecting and inserting multiple batches at once into the shuffler using the list of input tasks might help significantly with shuffle perf.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions