[SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join#54330
[SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join#54330peter-toth wants to merge 6 commits intoapache:masterfrom
KeyGroupedPartitioning and Storage Partition Join#54330Conversation
5122de6 to
114aee5
Compare
| case h: HashPartitioningLike => expandOutputPartitioning(h) | ||
| case c: PartitioningCollection => expandOutputPartitioning(c) | ||
| case other => other | ||
| val expandedPartitioning = expandOutputPartitioning(streamedPlan.outputPartitioning) |
There was a problem hiding this comment.
BroadcastHashJoinExec related changes are extracted to a separate PR: #54335, since those changes are valid on their own.
0c0e88b to
c1e3e93
Compare
KeyGroupedPartitioning and Storage Partition JoinKeyGroupedPartitioning and Storage Partition Join
c1e3e93 to
53034f5
Compare
|
This PR is requires and contains the changes of #54335. Once that PR is merged I will rebase this one. |
KeyGroupedPartitioning and Storage Partition JoinKeyGroupedPartitioning and Storage Partition Join
|
Why don't you merge #54335 , @peter-toth ? You already got the required community approval on your PR. |
szehon-ho
left a comment
There was a problem hiding this comment.
I think the coalesceRDD is a good idea, but its a bit risky I feel to change so much the DataSourceRDD. Is there another way? Maybe have a customRDD that holds the grouped partitions? though im not so familiar with this part
| * In addition, its length must be the same as the number of Spark partitions (and thus is a 1-1 | ||
| * mapping), and each row in `partitionValues` must be unique. | ||
| * Represents a partitioning where rows are split across partitions based on transforms defined by | ||
| * `expressions`. `partitionKeys`, should contain value of partition key(s) in ascending order, |
There was a problem hiding this comment.
nit: seems we lost 'if defined', so the comma doesn't make sense anymore
| * `partitionKeys` unique. | ||
| * | ||
| * The `originalPartitionValues`, on the other hand, are partition values from the original input | ||
| * The `originalPartitionKeys`, on the other hand, are partition values from the original input |
There was a problem hiding this comment.
nit: looks like 'originalPartitionKeys' are a copy of 'partitionKeys' before any grouping is applied, . Can we clarify the comment as its not clear from it currently
There was a problem hiding this comment.
I asked Claude to rewrite and clarify the documentation of KeyedPartitioning: d58cafe. I think its version sounds even better, but let me know.
| lazy val firstKeyedPartitioning = { | ||
| child.outputPartitioning.asInstanceOf[Partitioning with Expression].collectFirst { | ||
| case k: KeyedPartitioning => k | ||
| }.get |
There was a problem hiding this comment.
nit: how about add .getOrElse(throw new SparkException("requires child with KeyedPartitioning")) to be more clear when error happens
| readerStateThreadLocal.remove() | ||
| } | ||
| } else { | ||
| readerState.metrics.foreach(reader.initMetricsValues) |
There was a problem hiding this comment.
are we missing a close, maybe here to old readerState.reader?
cc @viirya i believe fixed a memory leak here to also take a look at the new approach
There was a problem hiding this comment.
Indeed. I refined ReaderState and close readerState.reader properly in 5ca18c3.
|
Overall i like the GroupPartitionExec idea, but definitely would be good to have some of @sunchao @viirya @chirag-s-db @cloud-fan to also take a look |
Sorry @dongjoon-hyun, I didn't notice your approval yesterday. Thanks for your review! @viirya requested a small change just now, once CI completes I will merge that PR and rebase this one. |
Initially I wanted to add a new RDD for
|
### What changes were proposed in this pull request? This is a minor refector of `BroadcastHashJoinExec.outputPartitioning` to: - simlify the logic and - make it future proof by using `Partitioning with Expression` instead of `HashPartitioningLike`. ### Why are the changes needed? Code cleanup and add support for future partitionings that implement `Expression` but not `HashPartitioningLike`. (Like `KeyedPartitioning` is in #54330.) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54335 from peter-toth/SPARK-55551-improve-broadcasthashjoinexec-output-partitioning. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
69ded9c to
8a087af
Compare
|
#54335 is merged and I've rebased this PR on latest |
|
Thanks for the refactor. I was actually wondering if this approach works: (using cursor generation, please double check if it makes sense). Its more localized to SPJ case then that can be used in GroupPartitionsExec like: The current code is definitely more Spark-native, reusing coalesceRDD, but my doubt is the threadlocal and chance for memory leak like fixed by @viirya : #51503 But ill defer to others, if people like this approach more. |
|
edit: i guess its what you are saying you considered, in your previous comment |
btw, didn't get this, are you saying there is some leak in current DataSourceRDD that need ThreadLocal to fix? Should it be fixed spearately? |
As far as I see you assume that the child is a Also, even if there is a
Not necessarily a leak, but there are some issues with custom metrics reporting and when the readers gets closed. Consider the following plan (without this PR): We have only 1 task in the stage due to coalesce(1) and that task calls the |
I see, sorry forgot about that case. Interesting, so you mean we are losing metrics. Should we at least add a test? It may make sense to do in separate pr, but depends the final approach. The approach does make sense, I am a bit unsure if ThreadLocal is the best/safe approach, consider the risk to introduce memory leak, as you can see its a bit tricky, but I am not so familiar with DataSourceRDD code. |
|
Sure, let me add a test tomorrow, and maybe someone can come up with a better idea to fix it. |
|
I extracted the metrics reporting bug / fix to SPARK-55619 / #54396 and added a new test. |
|
Thank you! |
What changes were proposed in this pull request?
This PR extracts partitiong grouping logic from
BatchScanExecto a newGroupPartitionsExecoperator and replacesKeyGroupedPartitioningwithKeyedPartitioning.KeyedPartitioningrepresents a partitioning where partition keys are known. It can be grouped (clustered) or not by partition keys. When grouping is required the new operator can be inserted into a plan at any place (similary to how exchanges are inserted to satisfy expected distributions) and so creating the necessary grouped/replicated partitions by keys.GroupPartitionsExecuses the already existingCoalescedRDDwith a newGroupedPartitionCoalescer.DataSourceRDDto its pre-SPJ form, but the change revealed a bug with custom metrics reporting. It turned out that if there is a coalesce operation right after a scan (which is the basis of how partition grouping in this implementation works, but can also happen without this PR when the coalesce is explicitely added to the plan), then multiple partitions (i.e. multiple partiton groups before this PR) of the scan can belong to a given Spark task. This is because the number of tasks are defined by the coalesce operation. This means thatDataSourceRDD.compute()can be called from a task multiple times and each call adds a newTaskCompletionListener, which causes reporting conflicting metrics (one metrics by each scan partition). This PR fixes the bug using aThreadLocalto track the current scan partition reader and to install only one listener per task.PartitionKeyinstead of the previousPartitionValuesto be in sync with the DSv2HasPartitionKeyinterface.StoragePartitionJoinParamsis not required inBatchScanExec, its fields are now part of the newGroupPartitionsExecoperator.Why are the changes needed?
To solve the issue of unecessary partition grouping (#53859) and simplify KGP/SPJ implementation.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing UTs adjusted and new UTs from #53859.
Was this patch authored or co-authored using generative AI tooling?
Yes, documentation and some helpers were added by Claude.