[SPARK-54593][SQL] Expand DPP for LogicalRelation and LogicalRDD #53324

dwsmith1983 · 2025-12-04T13:09:50Z

What changes were proposed in this pull request?

This PR extends Dynamic Partition Pruning (DPP) support to include LocalRelation and LogicalRDD as selective predicates in the PartitionPruning optimizer rule.

Modified hasSelectivePredicate() to treat LocalRelation and LogicalRDD as selective predicates
Modified calculatePlanOverhead() to handle LocalRelation and LogicalRDD with statistics as cached data sources with zero overhead
Added helper method isLogicalRDDWithStats() to distinguish LogicalRDDs with materialized statistics from those with default estimates

https://issues.apache.org/jira/browse/SPARK-54593

Why are the changes needed?

Expanding from previous commit and Jira ticket: #53263 and https://issues.apache.org/jira/browse/SPARK-54554

LocalRelation (from VALUES clauses) and LogicalRDD (from checkpoint or createDataFrame with statistics) represent small, materialized datasets that are ideal candidates for DPP optimization. However, the current implementation only recognizes Filter, but not these node types as selective predicates, missing optimization opportunities in broadcast joins.

By enabling DPP for these cases, queries joining partitioned tables with small in-memory datasets can benefit from runtime partition pruning, reducing data scanning and improving query performance.

Does this PR introduce any user-facing change?

No. This is a pure optimizer enhancement. Users may observe improved query performance for joins between partitioned tables and small datasets created via VALUES clauses or checkpoint operations, but there are no API or behavioral changes.

How was this patch tested?

Added 5 comprehensive tests to DynamicPartitionPruningSuite:

DPP with LocalRelation in broadcast join- Verifies DPP triggers for VALUES clause
DPP with LogicalRDD from cached DataFrame- Verifies DPP triggers for createDataFrame with RDD
DPP with empty LocalRelation- Ensures empty datasets don't cause failures
DPP should not trigger for LogicalRDD without originStats- Negative test verifying LogicalRDD without statistics doesn't trigger DPP
DPP with large LocalRelation- Verifies DPP works with multiple values

All tests explicitly verify DynamicPruningSubquery appears (or doesn't appear) in the optimized logical plan and use exact result verification with checkAnswer. All existing tests continue to pass.

Was this patch authored or co-authored using generative AI tooling?

No

expanded DPP for logical relation and RDD

49e290b

github-actions bot added the SQL label Dec 4, 2025

dwsmith1983 marked this pull request as draft December 5, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54593][SQL] Expand DPP for LogicalRelation and LogicalRDD #53324

[SPARK-54593][SQL] Expand DPP for LogicalRelation and LogicalRDD #53324

dwsmith1983 commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[SPARK-54593][SQL] Expand DPP for LogicalRelation and LogicalRDD #53324

Are you sure you want to change the base?

[SPARK-54593][SQL] Expand DPP for LogicalRelation and LogicalRDD #53324

Conversation

dwsmith1983 commented Dec 4, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant