Skip to content

Conversation

@leonardgeissler
Copy link
Collaborator

Adds split-up junctions and sorting for percentile predicates. And changes the transformer type.

I will add the regression tests in the coming days.
It is not needed for SIGMOD because the performance gains are only noticeable with very complex queries in exact mode.

However, I wanted to add this PR because it has been ready for a while, waiting for the other PR to be merged.

fix

add pp result size estimation

add num_workers constants

change costs

change transformer type

fix: change stop points

fix

add pp result size estimation

change threading

add num_workers constants

update constants

change costs

add optimizer

change transformer type
@leonardgeissler leonardgeissler requested review from Copilot and lbhm June 10, 2025 14:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new split-up junction rule alongside improved cost estimation for percentile predicates, refactors filtering limits to account for multiple workers, and updates executors to propagate a num_workers parameter.

  • Optimizer: adds split_up_junctions flag and rule; extends CostSorter with a regression model for percentile_op.
  • Executors: injects num_workers into threaded and prefiltering executors; centralizes filtering thresholds via get_filtering_stop_point.
  • Tests: updated test_optimizer.py and test assets to exercise the new junction splitting behavior.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
backend/backend/engine/optimizer.py Added split_up_junctions flag/rule and percentile regression in CostSorter
backend/backend/engine/constants.py Refactored FILTERING_STOP_POINTS into nested mapping; added get_filtering_stop_point
backend/backend/engine/execution/common.py Switched exceeds_filtering_limit to use get_filtering_stop_point and ids.size
backend/backend/engine/execution/threaded_prefiltering_executor.py Introduced num_workers parameter and threaded prefiltering logic updates
backend/backend/engine/execution/threaded_executor.py Simplified thread‐pool task submission and removed unused _thread_results
backend/backend/engine/execution/simple_executor.py Switched from Transformer to Transformer_NonRecursive
backend/backend/engine/execution/prefiltering_executor.py Added num_workers support; updated to non‐recursive transformer
backend/tests/test_optimizer.py Updated optimizer instantiations to pass split_up_junctions
backend/tests/assets/test_cases_optimizer.py Adjusted nested‐junction test case to reflect split‐up behavior
Comments suppressed due to low confidence (2)

backend/backend/engine/execution/threaded_prefiltering_executor.py:42

  • Adding a required num_workers parameter in the constructor without a default breaks backward compatibility. Consider providing a default value (e.g., 1) or making num_workers keyword-only to avoid API breakage.
def __init__(self, write_group: int, fainder_mode: FainderMode, num_workers: int,

backend/tests/test_optimizer.py:14

  • There aren't any standalone tests that verify the split-up junction behavior in isolation. Consider adding a focused test that uses split_up_junctions=True on a simple multi-term junction to validate correct binary splitting.
optimizer = Optimizer(cost_sorting=True, keyword_merging=False, split_up_junctions=False)

if comparison in {"gt", "ge"}:
percentile = 1 - percentile # Invert percentile for gt/ge comparisons

# Formular for the regression model for le
Copy link

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the comment: "Formular" should be "Formula".

Suggested change
# Formular for the regression model for le
# Formula for the regression model for le

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +120
if len(tree.children) > 2: # noqa: PLR2004
# Split the disjunction into multiple rules
Copy link

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split-up logic only runs once and may still leave more than two children, but the visitor won’t revisit those new nodes. Consider looping or recursively applying the split until each junction has at most two terms.

Suggested change
if len(tree.children) > 2: # noqa: PLR2004
# Split the disjunction into multiple rules
while any(isinstance(child, Tree) and child.data == "disjunction" and len(child.children) > 2 for child in tree.children): # noqa: PLR2004

Copilot uses AI. Check for mistakes.
optimizer = Optimizer(cost_sorting=True, keyword_merging=True, split_up_junctions=True)
plan = deepcopy(test_case["input_tree"])

assert test_case["all_rules"] == optimizer.optimize(plan)
Copy link

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] For consistency with the other tests, you might capture the result of optimizer.optimize(plan) in a variable (e.g., optimized_plan) before asserting, which improves readability.

Suggested change
assert test_case["all_rules"] == optimizer.optimize(plan)
optimized_plan = optimizer.optimize(plan)
assert test_case["all_rules"] == optimized_plan

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants