Commit 5419ff5
feat: hash partitioning satisfies subset (#19304)
## Which issue does this PR close?
- Closes #19269.
## Rationale for this change
See to issue #19269 for deeper rationale.
DF did not have the notion that being partitioned on a superset of the
required partitioning satisfied the condition. Having this logic will
eliminate unnecessary repartitions and in turn other operators like
partial aggregations.
I introduced this behavior with the `repartition_subset_satisfactions`
flag (default false) as there are some cases where repartitioning may
still be wanted when we satisfy partitioning via this subset property.
In particular, if when partitioned via Hash(a) there is data skew but
when partitioned on Hash(a, b) there is better distribution, a user may
want to turn this optimization off.
I also made it the case such that if we satisfy repartitioning via a
subset but the current amount of partitions < target_partitions, then we
will still repartition to maintain and increase parallelism in the
system when possible.
## What changes are included in this PR?
- Modified `satisfy()` logic to check for subsets and return an enum of
type of match: exact, subset, none
- Do in `EnforceDistribution`, where `satisfy()` is called, do not allow
subset logic for partitioned join operators as partitioning on each side
much match exactly, thus need to repartition if subset logic is true
- Created unit and sqllogictests
## Are these changes tested?
- Unit test
- sqllogictest
- tpch correctness
### Benchmarks
I did not see any drastic changes in benches, but the shuffle
eliminations will be great improvements for distributed DF.
<img width="628" height="762" alt="Screenshot 2025-12-12 at 8 28 15 PM"
src="https://github.com/user-attachments/assets/4b42945f-34e0-46c9-a4ce-e7ccdd0c0603"
/>
<img width="490" height="746" alt="Screenshot 2025-12-12 at 8 30 15 PM"
src="https://github.com/user-attachments/assets/846aef1b-8c5d-462d-83e7-7fa1e2a9372e"
/>
## Are there any user-facing changes?
Yes, users will now have the `repartition_subset_satisfications` option
as described in this PR
---------
Co-authored-by: Andrew Lamb <[email protected]>1 parent b3d2cb6 commit 5419ff5
File tree
11 files changed
+1352
-234
lines changed- datafusion
- common/src
- physical-expr/src
- physical-optimizer/src
- sqllogictest/test_files
- tpch/plans
- docs/source/user-guide
11 files changed
+1352
-234
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1000 | 1000 | | |
1001 | 1001 | | |
1002 | 1002 | | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
1003 | 1031 | | |
1004 | 1032 | | |
1005 | 1033 | | |
| |||
0 commit comments