Skip to content

New Optimizer

Martin Traverso edited this page Sep 2, 2015 · 15 revisions

(a place to jot down notes about the new plan representation & optimizer)

  • Query plan is modeled as a "program" using intermediate representation comprised by function calls and assignments. The logical type of each expression is some form of relation/collection/stream-of-rows.

  • For each relational expression, we can derive:

    • Logical properties such as predicate, uniqueness, type (schema), functional dependencies between fields.
    • Physical properties such as global partitioning, local ordering & grouping
  • Functions can be logical and/or physical (i.e., if they can be directly executed: join vs hash-join)

  • Possibly multiple optimizer implementations: heuristics/rewrites, cost-based, etc (TBD)

  • Cost-based optimizer

    • Cascades-style
    • Components:
      • Rules
        • Pattern + named arguments + required properties
        • Can produce multiple expressions
        • Types: logical transformation (e.g., push filter through project), implementation (join -> hash join), enforcement (sort before merge). The may not need to be explicitly identified as such.
      • Memo
        • Holds equivalence classes (name + list of expressions)
        • Memoizes optimization goals (i.e., best expression for a given equivalence class and physical requirements)
      • Cost
    • Optimization loop pseudo-code:
start:
- break up expression into single-assignment expression
- add each assignment to the memo in a separate equivalence class
- optimize(root class, unbounded cost, no physical reqs)

optimize(equivalence class, cost bound, requirements):
- initialize exploration queue (rule + top operator in equivalence class)
- find potential match candidates and add them to queue
- while queue is not empty
    - enumerate bindings for each named argument (by iterating over all expressions in each equivalence class that's part of the match)
    - if binding + physical requirements can be handled by rule
        - apply rule
        - for each expression generated by rule
            - add to memo
            - if top function is physical
                - determine cost bound for children
                - for each input
                    - derive required physical properties & cost upper bound
                    - optimize corresponding equivalence class with required properties and upper bound
                    - update max bound for remaining children 
            - find additional potential matches and enqueue

Open issues:

  • how to prioritize exploration candidates
  • memoize rule application to prevent re-exploration in case of repeated optimization calls (with different physical requirements)
  • we may need a way for a rule to short-circuit other exploration tasks for a given group (e.g., after constant folding)
  • we may need a way for a rule to prevent application of the same rule on expressions produced by the first application (e.g., join commutativity)
Clone this wiki locally