Skip to content

Commit ab44eb9

Browse files
authored
Extract correlated filters from the bottom of a hybrid tree into a join condition (#404)
Series of optimizations/adjustments to hybrid trees and relational trees, largely to improve the final SQL: - Runs a pass to extract as many correlated filter conditions as possible into the join keys/condition of the subtree. The pass runs right before de-correlation. - Overrides unnest_subqueries.py from sqlglot with a customized version that avoids SQLGlot's de-correlation procedure when it would produce `ARRAY_AGG` + `ARRAY_ANY` solutions (caused via a ripple effect of the new pass) - Adjusted de-correlation to preserve the join condition/keys from the now-replaced connection logic by adding them into the filters of the new subtree. This will ensure that any formerly correlated conditions that got pulled into the join get ejected back into the subtree as non-correlated filters if the subtree still gets de-correlated - Added ability to remove redundant filters atop a join if they are already present inside a join - Added the ability to push filters atop a join into the join condition unless there are window functions in the picture - Adjusts how cardinality works with join filters to improve when joins are/aren't pruned, including accounting for filters that get pushed into a join's condition - Added two new correl tests to account for some of the newer patterns (specifically with hybrid correlation extraction), and added all correl tests as SQL testss (only using sqlite), also deletes dead testing files. - Splits up some of the logic of `pull_project_into_aggregate` to run even if the input to an aggregate is not a project (specifically the parts that simplify aggregate calls). Also, replaces casting booleans to integers in that helper with `IFF(x, 1, 0)`
1 parent 199a753 commit ab44eb9

File tree

270 files changed

+3431
-1745
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

270 files changed

+3431
-1745
lines changed

pydough/conversion/filter_pushdown.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,10 @@
55
__all__ = ["push_filters"]
66

77

8+
import pydough.pydough_operators as pydop
89
from pydough.relational import (
910
Aggregate,
11+
CallExpression,
1012
ColumnReference,
1113
EmptySingleton,
1214
Filter,
@@ -240,6 +242,30 @@ def visit_join(self, join: Join) -> RelationalNode:
240242
self.filters = pushable_filters
241243
new_inputs.append(child.accept_shuttle(self))
242244

245+
# If there are no window functions in the filters, push any remaining
246+
# filters into the join condition if it is an inner join.
247+
if (
248+
join_type == JoinType.INNER
249+
and len(remaining_filters) > 0
250+
and not any(contains_window(expr) for expr in remaining_filters)
251+
):
252+
transposer.keep_input_names = True
253+
new_conjunction: set[RelationalExpression] = set()
254+
for expr in remaining_filters:
255+
new_conjunction.add(expr.accept_shuttle(transposer))
256+
if (
257+
isinstance(join._condition, CallExpression)
258+
and join._condition.op == pydop.BAN
259+
):
260+
new_conjunction.update(join._condition.inputs)
261+
else:
262+
new_conjunction.add(join._condition)
263+
cardinality = join.cardinality.add_filter()
264+
join._condition = RelationalExpression.form_conjunction(
265+
sorted(new_conjunction, key=repr)
266+
)
267+
remaining_filters = set()
268+
243269
# Materialize all of the remaining filters on top of a new join with
244270
# the new inputs.
245271
new_node = join.copy(inputs=new_inputs)

pydough/conversion/hybrid_connection.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -316,11 +316,54 @@ class HybridConnection:
316316
"""
317317

318318
parent: "HybridTree"
319+
"""
320+
The HybridTree that the connection exists within.
321+
"""
322+
319323
subtree: "HybridTree"
324+
"""
325+
The HybridTree corresponding to the child itself, starting from the bottom.
326+
"""
327+
320328
connection_type: ConnectionType
329+
"""
330+
The type of connection that the child subtree is being accessed with
331+
relative to the parent hybrid tree.
332+
"""
333+
321334
min_steps: int
335+
"""
336+
The step in the parent pipeline that this connection MUST be defined after.
337+
"""
338+
322339
max_steps: int
340+
"""
341+
The step in the parent pipeline that this connection MUST be defined before.
342+
"""
343+
323344
aggs: dict[str, HybridFunctionExpr]
345+
"""
346+
A dictionary storing information about how to aggregate the child before
347+
joining it to the parent. The keys are the names of the aggregation calls
348+
within the child subtree, and the values are the corresponding aggregation
349+
expressions defined relative to the child subtree.
350+
"""
351+
352+
always_exists: bool | None = None
353+
"""
354+
Whether the connection is guaranteed to have at least one matching
355+
record for every parent record. If None, this is unknown and must be
356+
inferred from the subtree (can be stored and then modified later).
357+
"""
358+
359+
def get_always_exists(self) -> bool:
360+
"""
361+
Returns whether the connection is guaranteed to have at least one
362+
matching record for every parent record.
363+
"""
364+
if self.always_exists is None:
365+
self.always_exists = self.subtree.always_exists()
366+
return self.always_exists
324367

325368
def __eq__(self, other):
326369
return (

0 commit comments

Comments
 (0)