Skip to content

Conversation

@dwsmith1983
Copy link
Contributor

What changes were proposed in this pull request?

This PR extends Dynamic Partition Pruning (DPP) support to include LocalRelation and LogicalRDD as selective predicates in the PartitionPruning optimizer rule.

  1. Modified hasSelectivePredicate() to treat LocalRelation and LogicalRDD as selective predicates
  2. Modified calculatePlanOverhead() to handle LocalRelation and LogicalRDD with statistics as cached data sources with zero overhead
  3. Added helper method isLogicalRDDWithStats() to distinguish LogicalRDDs with materialized statistics from those with default estimates

https://issues.apache.org/jira/browse/SPARK-54593

Why are the changes needed?

Expanding from previous commit and Jira ticket: #53263 and https://issues.apache.org/jira/browse/SPARK-54554

LocalRelation (from VALUES clauses) and LogicalRDD (from checkpoint or createDataFrame with statistics) represent small, materialized datasets that are ideal candidates for DPP optimization. However, the current implementation only recognizes Filter, but not these node types as selective predicates, missing optimization opportunities in broadcast joins.

By enabling DPP for these cases, queries joining partitioned tables with small in-memory datasets can benefit from runtime partition pruning, reducing data scanning and improving query performance.

Does this PR introduce any user-facing change?

No. This is a pure optimizer enhancement. Users may observe improved query performance for joins between partitioned tables and small datasets created via VALUES clauses or checkpoint operations, but there are no API or behavioral changes.

How was this patch tested?

Added 5 comprehensive tests to DynamicPartitionPruningSuite:

  1. DPP with LocalRelation in broadcast join- Verifies DPP triggers for VALUES clause
  2. DPP with LogicalRDD from cached DataFrame- Verifies DPP triggers for createDataFrame with RDD
  3. DPP with empty LocalRelation- Ensures empty datasets don't cause failures
  4. DPP should not trigger for LogicalRDD without originStats- Negative test verifying LogicalRDD without statistics doesn't trigger DPP
  5. DPP with large LocalRelation- Verifies DPP works with multiple values

All tests explicitly verify DynamicPruningSubquery appears (or doesn't appear) in the optimized logical plan and use exact result verification with checkAnswer. All existing tests continue to pass.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Dec 4, 2025
@dwsmith1983 dwsmith1983 marked this pull request as draft December 5, 2025 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant