Skip to content

Reduce overhead of Partitioned hash join #19789

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge?

Currently, a high part of the cost of the Partitioned join is repartitioning the entire build and probe side of the join - which currently copies all of the columns twice(!) during take and coalesce, this makes this hash join slower if this side is large/wide or both. We could avoid one copy in RepartitionExec (using arrow-rs coalesce API when fully implemented), but not two.

CollectLeft avoids this cost at the right side, but the building phase is single threaded, which greatly limits the parallelism in the query - the query/partitions needs to wait until the entire build side is finished, or some buffering/eager evaluation as implemented here #19761 might yield some

Describe the solution you'd like

We should be able to do the hash % partition of the probe side during the join, avoiding the need for a RepartitionExec. We only pass the indices of the matching partitions to the correct partition in the left - which should greatly reduce the overhead of the repartitioning.

When this is implemented, we might want to look at hash_join_single_partition_threshold and hash_join_single_partition_threshold_rows again which could be reduced to make most joins run fully in parallel./

Originally posted by @Dandandan in #19761 (comment)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions