-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
Currently, a high part of the cost of the Partitioned join is repartitioning the entire build and probe side of the join - which currently copies all of the columns twice(!) during take and coalesce, this makes this hash join slower if this side is large/wide or both. We could avoid one copy in RepartitionExec (using arrow-rs coalesce API when fully implemented), but not two.
CollectLeft avoids this cost at the right side, but the building phase is single threaded, which greatly limits the parallelism in the query - the query/partitions needs to wait until the entire build side is finished, or some buffering/eager evaluation as implemented here #19761 might yield some
Describe the solution you'd like
We should be able to do the hash % partition of the probe side during the join, avoiding the need for a RepartitionExec. We only pass the indices of the matching partitions to the correct partition in the left - which should greatly reduce the overhead of the repartitioning.
When this is implemented, we might want to look at hash_join_single_partition_threshold and hash_join_single_partition_threshold_rows again which could be reduced to make most joins run fully in parallel./
Originally posted by @Dandandan in #19761 (comment)