From 1967c048176f195eb50ab122f100ae133d2642a0 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Mon, 21 Apr 2025 11:58:16 -0400 Subject: [PATCH 01/16] Create documentation for irs --- documentation/ir.md | 86 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 documentation/ir.md diff --git a/documentation/ir.md b/documentation/ir.md new file mode 100644 index 000000000..b45111f0a --- /dev/null +++ b/documentation/ir.md @@ -0,0 +1,86 @@ +# PyDough Internal Representations Guide + +This document describes the various IRs used by PyDough to convert raw PyDough code into SQL text, as well as the conversion processes between them. + + + +- [Overview](#overview) +- [Unqualified Nodes](#unqualified-nodes) + - [Unqualified Transform](#unqualified-transform) +- [QDAG Nodes](#qdag-nodes) + - [Qualification](#qualification) +- [Hybrid Tree](#hybrid-tree) + - [Hybrid Conversion](#hybrid-conversion) + - [Hybrid Decorrelation](#hybrid-decorrelation) +- [Relational Tree](#relational-tree) + - [Relational Conversion](#relational-conversion) + - [Relational Optimization](#relational-conversion) +- [SQLGlot AST](#sqlglot-ast) + - [SQLGlot Conversion](#sqlglot-conversion) + + + + +## Overview + +TODO + + +## Unqualified Nodes + +TODO + + +### Unqualified Transform + +TODO + + +## QDAG Nodes + +TODO + + +### Qualification + +TODO + + +## Hybrid Tree + +TODO + + +### Hybrid Conversion + +TODO + + +### Hybrid Decorrelation + +TODO + + +## Relational Tree + +TODO + + +### Relational Conversion + +TODO + + +### Relational Optimization + +TODO + + +## SQLGlot AST + +TODO + + +### SQLGlot Conversion + +TODO From e41642d3ec6896764e133673407c0f8a1d151306 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Mon, 21 Apr 2025 12:47:38 -0400 Subject: [PATCH 02/16] Adding more sections --- README.md | 2 ++ documentation/ir.md | 17 ++++++++++++++++- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index e62794af4..e27d0711b 100644 --- a/README.md +++ b/README.md @@ -95,6 +95,8 @@ testing, the `tpch.db` file must be located in the `tests` directory. Additionally, the [`setup_defog.sh`](https://github.com/bodo-ai/PyDough/blob/main/tests/setup_defog.sh) script must be run so that the `defog.db` file is located in the `tests` directory. +To learn more about the implementation of PyDough in order to modify it, read about the various [IRs used within PyDough](https://github.com/bodo-ai/PyDough/blob/main/documentation/ir.md) + ## Running CI Tests To run our CI tests on your PR, you must include the flag `[run CI]` in latest diff --git a/documentation/ir.md b/documentation/ir.md index b45111f0a..af45dcf54 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -17,13 +17,23 @@ This document describes the various IRs used by PyDough to convert raw PyDough c - [Relational Optimization](#relational-conversion) - [SQLGlot AST](#sqlglot-ast) - [SQLGlot Conversion](#sqlglot-conversion) + - [SQLGlot Optimization](#sqlglot-optimization) ## Overview -TODO +The overarching pipeline for converting PyDough Python text into SQL text is as follows: +1. The Python text is intercepted and re-written in a [transformation](#unqualified-transform) that replaces undefined variable names with certain objects, ensuring that when the Python code is executed the result is an [unqualified node](#unqualified-nodes). These unqualified nodes are very minimal in terms of information stored, +2. When `to_sql` or `to_df` is called on this unqualified nodes, it is first sent through a process called [qualification](#qualification) which converts unqualified nodes into qualified DAG nodes, or [QDAG nodes](#qdag) for short. These QDAG nodes utilize the metadata in order to correctly associate every aspect of the PyDough logic with the data being analyzed/transformed, and is also where a great deal of the verification of the PyDough code's validity happens. +3. Next, the QDAG nodes are run through a process called [hybrid conversion](#hybrid-conversion) which restructures the logic into the datastructure known as the [hybrid tree](#hybrid-tree) in order to better organize the types of ways different subtrees of the data are linked together. +4. The hybrid tree is further transformed by the [decorrelation procedure](#hybrid-decorrelation) to remove correlated references created by the hybrid conversion process. +5. The transformed hybrid tree is converted into a [relational tree](#relational-tree) highly reminiscent of the datastructure used to represent relational algebra in frameworks such as Apache Calcite. The conversion to this datastructure is called [relational conversion](#relational-conversion). +6. Several [optimizations](#relational-optimization) are performed on the relational tree to combine/delete/split/transpose relational nodes, resulting in plans that are better for performance and/or visual quality when converted to SQL. +7. The Relational tree is [converted](#sqlglot-conversion) into the [internal AST](#sqlglot-ast) used by the open source Python library SQLGlot. This library is used for transpiling between different SQL dialects, so it is trivial to convert the SQLglot AST into SQL text of many different dialects. +8. The SQLGlot AST is [simplified & optimized](#sqlglot-optimization) less so to improve the performance of the SQL when executed, and moreso to improve the visual quality when it is converted into text. +9. A simple method call on the final SQLGlot AST converts it into SQL text of the desired dialect, which can then be executed via a database connector API. ## Unqualified Nodes @@ -84,3 +94,8 @@ TODO ### SQLGlot Conversion TODO + + +### SQLGlot Optimization + +TODO From 2d1b01b2e81e7d624d2a22c8afc06cce99a674e3 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Mon, 21 Apr 2025 13:14:41 -0400 Subject: [PATCH 03/16] Adding first mermaid graph --- documentation/ir.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/documentation/ir.md b/documentation/ir.md index af45dcf54..672b4a2fd 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -35,6 +35,33 @@ The overarching pipeline for converting PyDough Python text into SQL text is as 8. The SQLGlot AST is [simplified & optimized](#sqlglot-optimization) less so to improve the performance of the SQL when executed, and moreso to improve the visual quality when it is converted into text. 9. A simple method call on the final SQLGlot AST converts it into SQL text of the desired dialect, which can then be executed via a database connector API. +To recap, the overall pipeline is as follows (if viewing with VSCode preview, must install the mermaid markdown extension): +```mermaid +flowchart LR + A[Python + Text] -->|"Unqualified + Transform"| B(Unqualified + Node) + B -->|Qualification| C[QDAG + Node] + C -->|"Hybrid + Conversion"| D[Hybrid Tree] + D <--> D'{{Hybrid + Decorrelation}} + D --> E[Relational + Tree] + E <--> E'{{Relational + Optimiation}} + F <--> F'{{SQLGlot + Optimization}} + E -->|SQLGlot + Conversion| F[SQLGlot + AST] + F --> G[SQL + Text] +``` + + ## Unqualified Nodes From 9480f6fd171832c7fa7df4bcb3ebff4990f6de9a Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Mon, 21 Apr 2025 13:17:49 -0400 Subject: [PATCH 04/16] Reorienting flowchart --- documentation/ir.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index 672b4a2fd..aee33d3ed 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -35,9 +35,9 @@ The overarching pipeline for converting PyDough Python text into SQL text is as 8. The SQLGlot AST is [simplified & optimized](#sqlglot-optimization) less so to improve the performance of the SQL when executed, and moreso to improve the visual quality when it is converted into text. 9. A simple method call on the final SQLGlot AST converts it into SQL text of the desired dialect, which can then be executed via a database connector API. -To recap, the overall pipeline is as follows (if viewing with VSCode preview, must install the mermaid markdown extension): +To recap, the overall pipeline is as follows (if viewing with VSCode preview, you must install the mermaid markdown extension): ```mermaid -flowchart LR +flowchart TD A[Python Text] -->|"Unqualified Transform"| B(Unqualified From 0b40edbd3bb28e2b820dfa8563ce65e9ea0e1847 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Mon, 21 Apr 2025 13:58:23 -0400 Subject: [PATCH 05/16] Adding unqualified node diagram --- documentation/ir.md | 47 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/documentation/ir.md b/documentation/ir.md index aee33d3ed..fbcf6ffba 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -44,6 +44,7 @@ flowchart TD Node) B -->|Qualification| C[QDAG Node] + B'[Metadata] -->C C -->|"Hybrid Conversion"| D[Hybrid Tree] D <--> D'{{Hybrid @@ -67,6 +68,52 @@ flowchart TD TODO +Below is an example of the structure of unqualified nodes for the following PyDough expression: +```py +nation_info = nations.calculate( + region_name=region.name, + nation_name=name, + n_customers_in_debt=COUNT(customers.WHERE(acctbal < 0))) +.ORDER_BY(nation_name.ASC()) +``` + +```mermaid +flowchart RL + A[OrderBy] --> B[Calculate] + B --> C[Access: + 'nations'] + C --> D[ROOT] + B -.-> B1A[Access: + 'name'] + B1A --> B1B[Access: + 'region'] + B1B --> B1C[ROOT] + B -.-> B2A[Access: + 'name'] + B2A --> B2B[ROOT] + B -.-> B3A[Call: + 'COUNT'] + B3A --> B3B[Where] + B3B --> B3C[Access: + 'customers'] + B3C --> B3D[ROOT] + B3B -.-> P1[Call: + '<'] + P1 --> P2[Access: + 'acctbal'] + P2 --> P3[ROOT] + P1 --> P4[Literal: + 0] + A -.-> A1[Collation: + Ascending] + A1 --> A2[Access: + 'nation_name' + ] + A2 --> A3[ROOT] +``` + +In the example above, `nation_info` refers to the `OrderBy` node on the right side of the diagram. + ### Unqualified Transform From f9f4a74b76ce4bc18556f73c1c9661d882cbfe37 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Mon, 21 Apr 2025 15:45:38 -0400 Subject: [PATCH 06/16] Hybrid diagrams WIP --- documentation/ir.md | 144 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 141 insertions(+), 3 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index fbcf6ffba..c6b31d59a 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -70,11 +70,11 @@ TODO Below is an example of the structure of unqualified nodes for the following PyDough expression: ```py -nation_info = nations.calculate( +nation_info = nations.CALCULATE( region_name=region.name, nation_name=name, - n_customers_in_debt=COUNT(customers.WHERE(acctbal < 0))) -.ORDER_BY(nation_name.ASC()) + n_customers_in_debt=COUNT(customers.WHERE(acctbal < 0)) +).ORDER_BY(nation_name.ASC()) ``` ```mermaid @@ -119,6 +119,31 @@ In the example above, `nation_info` refers to the `OrderBy` node on the right si TODO +For an example of the QDAG structure, consider the `nation_info` example earlier from the unqualified nodes: + +```py +nation_info = nations.CALCULATE( + region_name=region.name, + nation_name=name, + n_customers_in_debt=COUNT(customers.WHERE(acctbal < 0)) +).ORDER_BY(nation_name.ASC()) +``` + +This has the following structure as QDAG nodes: + +``` +──┬─ TPCH + ├─── TableCollection[nations] + ├─┬─ Calculate[region_name=$1.name, nation_name=name, n_customers_in_debt=COUNT($2)] + │ ├─┬─ AccessChild + │ │ └─── SubCollection[region] + │ └─┬─ AccessChild + │ ├─── SubCollection[customers] + │ └─── Where[acctbal < 0] + └─── OrderBy[nation_name.ASC(na_pos='first')] +``` + + ## QDAG Nodes @@ -139,6 +164,119 @@ TODO TODO +For an example of the Hybrid Tree structure, consider the `nation_info` example earlier from the unqualified nodes: + +```py +nation_info = nations.CALCULATE( + region_name=region.name, + nation_name=name, + n_customers_in_debt=COUNT(customers.WHERE(acctbal < 0)) +).ORDER_BY(nation_name.ASC()) +``` + +```mermaid +flowchart TD + subgraph L1["(H1)"] + A[Root] + end + subgraph L2["(H2)"] + direction LR + B[Sub-Collection: + nations] + C["Calculate + region_name: $0.name + n_customers_in_debt: DEFAULT_TO($1.agg_0, 0) + nation_name: name + "] + D["OrderBy + nation_name (ascending)"] + B --> C --> D + end + subgraph L4c["($1)"] + subgraph L4["(H4)"] + direction LR + R[Sub-Collection: + customers] + S[Where: + acctbal < 0] + R --> S + end + T["Info: + join_keys: [(key, nation_key)] + agg_keys: [nation_key] + aggs: {'agg_0': COUNT()} + "] + end + subgraph L3c["($0)"] + subgraph L3["(H3)"] + Q[Sub-Colection: + region] + end + U["Info: + join_keys: [(region_key, key)] + "] + end + L1 --> L2 + L2 -.->|Singular| L3c + L2 -.->|Aggregation| L4c +``` + +In this example, the main hybrid tree has two levels H1 and `H2` (`H2` is the value pointed to when the QDAG is converted to a hybrid tree). +- `H1`/`H2` have a parent/successor relationship where H1 is the parent and `H2` is the successor. +- `H1` has no children and its pipeline has a single operation denoting the root level. +- `H2` has two children (`$0` and `$1`) and its pipeline has 3 operations: + - An access to the nations collection (how the step-down from `H1` to `H2` begins) + - A calculate that defines `region_name`, `nation_name`, and `n_customers_in_debt` + - An order-by that sorts by `nation_name` in ascending order. +- `$0` is a singular access, meaning the data from `H2` and `$0` can be directly joined without needing to worry about changes in cardinality. They are joined on the condition that the `region_key` term from `H2` equals the `key` term from the bottom subtree of `$0` (which is `H3`). + - The contents of `$0` is just a single tree level `H3` which does not have any children and has a pipeline containing only an access to the `regions` collection. +- `$1` is an aggregation access, meaning the data from `$1` must first be aggregated before it is joined with `H2`. The aggregation is done by grouping on the `nation_key` field of the bottom subtree of `$1` (which is `H4`) and computes the term `agg_0` as `COUNT()`. Then, the result is joined with `H2` on the condition that the `key` term from `H2` equals the `nation_key` field just used to aggregate. + - The contents of `$1` is just a single tree level `H4` which does not have any children and has a pipeline containing an access to the `customers` collection followed by a filter on the condition that `acctbal < 0`. + + +Now consider a slightly more complex example below. This PyDough code finds the 5 nations with the largest number of orders made by customers in that nation in the building market segment where the total price of the order is at least double the average of the total prices of **all** orders. +```py +global_info = TPCH.CALCULATE(avg_price=AVG(Orders.total_price)) +building_customers = customers.WHERE(mktsegment=="BUILDING") +selected_orders = building_customers.WHERE(total_price >= 2.0 * avg_price) +nation_info = Nations.CALCULATE( + name=name, + n_expensive_building_orders=COUNT(selected_orders), +).TOP_K(5, by=n_expensive_orders.DESC()) +``` + +Below is the Hybrid Tree structure for this query: + +```mermaid +flowchart TD + subgraph L1["(H1)"] + direction LR + A["Root"] + end + + subgraph C1["($0)"] + subgraph L3["(H3)"] + direction LR + B["orders"] + end + end + + subgraph L2["(H2)"] + direction LR + end + + subgraph C2["($0)"] + subgraph L4["(H4)"] + direction LR + C["customers"] + end + subgraph L5["(H5)"] + direction LR + D["orders"] + end + end +``` + ### Hybrid Decorrelation From 7b7c3e571b99d49978bb92982225f83256b86ca1 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Tue, 22 Apr 2025 11:51:33 -0400 Subject: [PATCH 07/16] Working on second hybrid doc --- documentation/ir.md | 45 +++++++++++++++++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 6 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index c6b31d59a..3f24240fc 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -181,7 +181,7 @@ flowchart TD end subgraph L2["(H2)"] direction LR - B[Sub-Collection: + B[Collection: nations] C["Calculate region_name: $0.name @@ -201,7 +201,7 @@ flowchart TD acctbal < 0] R --> S end - T["Info: + T["Child Info: join_keys: [(key, nation_key)] agg_keys: [nation_key] aggs: {'agg_0': COUNT()} @@ -212,7 +212,7 @@ flowchart TD Q[Sub-Colection: region] end - U["Info: + U["Child Info: join_keys: [(region_key, key)] "] end @@ -252,29 +252,62 @@ flowchart TD subgraph L1["(H1)"] direction LR A["Root"] + A'["Calculate: + avg_price=$0.agg_0"] + A --> A' end subgraph C1["($0)"] subgraph L3["(H3)"] direction LR - B["orders"] + B["Collection: + orders"] end + data0["Child Info: + join_keys: [] + agg_keys: [] + aggs: {'agg_0': AVG(total_price)}"] end + L1 -..->|Aggregation| C1 subgraph L2["(H2)"] direction LR + E["Collection: + nations"] + F["Calculate + name: name + n_expensive_orders: DEFAULT_TO($0.agg_0, 0)"] + G["Limit (5) + n_expensive_orders (descending)"] + E --> F --> G end subgraph C2["($0)"] subgraph L4["(H4)"] direction LR - C["customers"] + C["Sub-Collection: + customers"] + H["Where + mktsgment == 'Building'"] + C --> H end subgraph L5["(H5)"] direction LR - D["orders"] + D["Sub-Collection: + orders"] + D'["Where: + total_price >= 2.0 * CORREL(avg_price)"] + D --> D' end + data1["Child Info: + join_keys: [(key, BACK(1).nation_key)] + agg_keys: [BACK(1).nation_key] + aggs: {'agg_0': COUNT()}"] end + + L1 --> L2 + L2 -.->|Aggregation| C2 + L4 --> L5 ``` From 7a45407527840453a062a3d88ac37c812671fc94 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Thu, 24 Apr 2025 13:26:00 -0400 Subject: [PATCH 08/16] Updated examples and added decorrelation --- documentation/ir.md | 372 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 313 insertions(+), 59 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index 3f24240fc..038185f7b 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -66,14 +66,26 @@ flowchart TD ## Unqualified Nodes -TODO +The first intermediary representation of PyDough are the unqualified nodes, which carry minimal semantic information and instead carry the syntactic description of the PyDough query. Every unqualified node has a field `_parcel` which is a tuple of values, where the type signature of the tuple depends on which kind of unqualified node it is. Nodes that build on top of another node contain that other node object in their `_parcel`. + +The reason this `_parcel` name is special (and RESERVED) is because whenever getattr is used on an UnqualifiedNode, except for very few names, the logic will instead return a new node referring to an access of a term from the previous node. E.g. if `foo` is an unqualified node and I write `foo.bar`, this returns `UnqualifiedAccess(foo, "bar")` which means "access term `bar` from `foo`". + +The same idea of Python functionality building new nodes on top of existing nodes also goes for PyDough operations & Python magic methods. For example: +- `foo.CALCULATE(...)` returns an unqualified node for a calculate operation that points to `foo` in its parcel as the thing it is building on top of. +- `x == y` returns an unqualified node for a function call where the operator is `==` and the operands are `x` and `y`. + +> [!NOTE] +> As a consequence of the `==` behavior, unqualified nodes should never be used as dictionary keys, cached via `@cache`, or compared with `==`, because this will always return a new unqualified node instead of a boolean indicating whether or not they are equal. Instead, it is better to convert the unqualified nodes to strings then check if the strings are equal. All unqualified nodes have a repr implementation that dumps their full structure (so `str` and `repr` should NOT be used as casting functions in PyDough). + +Most magic methods have this sort of behavior, with some notable exceptions. For example, `__len__` is not allowed because any implementation of that magic method in Python must return an integer, so it cannot return a call to the `LENGTH` function in PyDough (unlike `__abs__`, which is implemented so when `abs` is called on an unqualified node it invokes the `ABS` PyDough function). + Below is an example of the structure of unqualified nodes for the following PyDough expression: ```py nation_info = nations.CALCULATE( region_name=region.name, nation_name=name, - n_customers_in_debt=COUNT(customers.WHERE(acctbal < 0)) + n_orders_from_debt_customers=COUNT(customers.WHERE(acctbal < 0).orders) ).ORDER_BY(nation_name.ASC()) ``` @@ -93,7 +105,9 @@ flowchart RL B2A --> B2B[ROOT] B -.-> B3A[Call: 'COUNT'] - B3A --> B3B[Where] + B3A --> B3A1[Access: + Orders] + B3A1 --> B3B[Where] B3B --> B3C[Access: 'customers'] B3C --> B3D[ROOT] @@ -112,11 +126,30 @@ flowchart RL A2 --> A3[ROOT] ``` -In the example above, `nation_info` refers to the `OrderBy` node on the right side of the diagram. +In the example above, the value of the `nation_info` is the `OrderBy` node on the right side of the diagram. ### Unqualified Transform +The unqualified nodes are created by executing Python code after a transformation is applied to modify the text. This modification creates a variable at the top of the code block called `_ROOT` which is an unqualified root, then replaces all undefined variables `x` with `_ROOT.x`. For example, the `nation_info` example earlier would be rewritten into the following, which then gets executed with `graph` passed in to the environment. + +``` +_ROOT = UnqualifiedRoot(graph) +nation_info = _ROOT.nations.CALCULATE( + region_name=_ROOT.region.name, + nation_name=_ROOT.name, + n_orders_from_debt_customers=_ROOT.COUNT(_ROOT.customers.WHERE(_ROOT.acctbal < 0).orders) +).ORDER_BY(_ROOT.nation_name.ASC()) +``` + +All of these `_ROOT.x` terms become `UnqualifiedAccess` nodes accessing term `x` and pointing to `_ROOT` as the thing they are accessing from, except for `_ROOT.COUNT`, which becomes a function call because the transformation recognizes this as a function name. + +There are several variations of the logic that invoke this rewrite in different contexts, (e.g. unit tests vs jupyter notebooks) but they all occur in [unqualified_transform.py](../PyDough/unqualified/unqualified_transform.py). + + + +## QDAG Nodes + TODO For an example of the QDAG structure, consider the `nation_info` example earlier from the unqualified nodes: @@ -125,7 +158,7 @@ For an example of the QDAG structure, consider the `nation_info` example earlier nation_info = nations.CALCULATE( region_name=region.name, nation_name=name, - n_customers_in_debt=COUNT(customers.WHERE(acctbal < 0)) + n_orders_from_debt_customers=COUNT(customers.WHERE(acctbal < 0).orders) ).ORDER_BY(nation_name.ASC()) ``` @@ -139,16 +172,11 @@ This has the following structure as QDAG nodes: │ │ └─── SubCollection[region] │ └─┬─ AccessChild │ ├─── SubCollection[customers] - │ └─── Where[acctbal < 0] + │ └─┬─ Where[acctbal < 0] + │ └─── SubCollection[orders] └─── OrderBy[nation_name.ASC(na_pos='first')] ``` - - -## QDAG Nodes - -TODO - ### Qualification @@ -170,7 +198,7 @@ For an example of the Hybrid Tree structure, consider the `nation_info` example nation_info = nations.CALCULATE( region_name=region.name, nation_name=name, - n_customers_in_debt=COUNT(customers.WHERE(acctbal < 0)) + n_customers_in_debt=COUNT(customers.WHERE(acctbal < 0).orders) ).ORDER_BY(nation_name.ASC()) ``` @@ -194,12 +222,17 @@ flowchart TD end subgraph L4c["($1)"] subgraph L4["(H4)"] - direction LR - R[Sub-Collection: - customers] - S[Where: - acctbal < 0] - R --> S + direction LR + R[Sub-Collection: + customers] + S[Where: + acctbal < 0] + R --> S + end + subgraph L5["(H5)"] + direction LR + V[Sub-Collection: + orders] end T["Child Info: join_keys: [(key, nation_key)] @@ -219,6 +252,7 @@ flowchart TD L1 --> L2 L2 -.->|Singular| L3c L2 -.->|Aggregation| L4c + L4 --> L5 ``` In this example, the main hybrid tree has two levels H1 and `H2` (`H2` is the value pointed to when the QDAG is converted to a hybrid tree). @@ -230,19 +264,24 @@ In this example, the main hybrid tree has two levels H1 and `H2` (`H2` is the va - An order-by that sorts by `nation_name` in ascending order. - `$0` is a singular access, meaning the data from `H2` and `$0` can be directly joined without needing to worry about changes in cardinality. They are joined on the condition that the `region_key` term from `H2` equals the `key` term from the bottom subtree of `$0` (which is `H3`). - The contents of `$0` is just a single tree level `H3` which does not have any children and has a pipeline containing only an access to the `regions` collection. -- `$1` is an aggregation access, meaning the data from `$1` must first be aggregated before it is joined with `H2`. The aggregation is done by grouping on the `nation_key` field of the bottom subtree of `$1` (which is `H4`) and computes the term `agg_0` as `COUNT()`. Then, the result is joined with `H2` on the condition that the `key` term from `H2` equals the `nation_key` field just used to aggregate. - - The contents of `$1` is just a single tree level `H4` which does not have any children and has a pipeline containing an access to the `customers` collection followed by a filter on the condition that `acctbal < 0`. +- `$1` is an aggregation access, meaning the data from `$1` must first be aggregated before it is joined with `H2`. The aggregation is done by grouping on the `nation_key` field of the bottom subtree of `$1` (which is `H5`) and computes the term `agg_0` as `COUNT()`. Then, the result is joined with `H2` on the condition that the `key` term from `H2` equals the `nation_key` field just used to aggregate. + - The contents of `$1` is two tree levels `H4` and `H5`. `H4` does not have any children and has a pipeline containing an access to the `customers` collection followed by a filter on the condition that `acctbal < 0`. `H5` does not have any children and only contains a single operation accessing the `orders` sub-collection of the parent level. + + + +### Hybrid Decorrelation +To understand why de-correlation matters, first consider the slightly more complex PyDough code example below. This PyDough code finds, for each nation in Europe, the total purchase quantity made by customers in that nation from suppliers in the same nation. 5 nations with the largest number of orders made by customers in that nation in the building market segment where the total price of the order is at least double the average of the total prices of **all** orders. -Now consider a slightly more complex example below. This PyDough code finds the 5 nations with the largest number of orders made by customers in that nation in the building market segment where the total price of the order is at least double the average of the total prices of **all** orders. ```py -global_info = TPCH.CALCULATE(avg_price=AVG(Orders.total_price)) -building_customers = customers.WHERE(mktsegment=="BUILDING") -selected_orders = building_customers.WHERE(total_price >= 2.0 * avg_price) -nation_info = Nations.CALCULATE( - name=name, - n_expensive_building_orders=COUNT(selected_orders), -).TOP_K(5, by=n_expensive_orders.DESC()) +selected_nations = regions.WHERE( + name == "EUROPE" +).nations.CALCULATE(nation_name=name) +domestic_lines = customers.orders.lines.WHERE(supplier.nation.name == nation_name) +nation_domestic_purchase_info = selected_nations.CALCULATE( + nation_name, + domestic_quantity=SUM(domestic_lines.quantity), +) ``` Below is the Hybrid Tree structure for this query: @@ -252,68 +291,283 @@ flowchart TD subgraph L1["(H1)"] direction LR A["Root"] - A'["Calculate: - avg_price=$0.agg_0"] - A --> A' end - subgraph C1["($0)"] - subgraph L3["(H3)"] + subgraph R1["(H2)"] + direction LR + R["Collection: + regions"] + S["Where + name == 'EUROPE'"] + R --> S + end + + subgraph L2["(H3)"] + direction LR + E["Sub-collection: + nations"] + F["Calculate + nation_name: name"] + G["Calculate + nation_name + domestic_quantity = DEFAULT_TO($0.agg_0, 0)"] + E --> F --> G + end + + subgraph C2["($0)"] + subgraph L4["(H4)"] + direction LR + C["Sub-Collection: + customers"] + end + subgraph L5["(H5)"] direction LR - B["Collection: + D["Sub-Collection: orders"] end - data0["Child Info: - join_keys: [] - agg_keys: [] - aggs: {'agg_0': AVG(total_price)}"] + subgraph L6["(H6)"] + direction LR + V1["Sub-Collection: + lines"] + V2["Where + $0.name == CORREL(nation_name)"] + V1 --> V2 + end + subgraph C4["($0)"] + direction TB + subgraph H7["(H7)"] + Q1["Sub-Collection: + supplier"] + end + subgraph H8["(H8)"] + Q2["Sub-Collection: + nation"] + end + data2["Child Info: + join_keys: [(supplier_key, BACK(1).key)]"] + end + data1["Child Info: + join_keys: [(key, BACK(1).nation_key)] + agg_keys: [BACK(1).nation_key] + aggs: {'agg_0': SUM(quantity)}"] end - L1 -..->|Aggregation| C1 - subgraph L2["(H2)"] + L1 --> R1 + R1 --> L2 + L2 -.->|Aggregation| C2 + L4 --> L5 + L5 --> L6 + L6 -.->|Singular| C4 + Q1 --> Q2 +``` + +Notice that `H6` contains a filter condition `$0.name == CORREL(nation_name)`. This means that even though `H6` is inside child `$0` of `H3` (not to be confused with child `$0` of `H6`), it still references the `nation_name` field from `H3`. This is a called a correlated reference, because it means that child `$0` of `H3` requires information from `H3` in order to be derived, but the whole point is that the data from child `$0` is calculated then joined onto he data from `H3`, so there is a catch-22. + +To fix this, we next run the de-correlation procedure. This will recursively traverse the entire tree and look for hybrid nodes that have a correlated child (where the child type is NOT semi/anti join, since those two do allow correlated references). This procedure will reach tree `H3` and see that its child `$0` is correlated and is an `AGGREGATION` connection, and will therefore de-correlate it. It does so by copying the entire hybrid tree so far (`H1`, `H2`, and the first two operators from `H3`) and attaching them to the top of child `$0`, above `H4`. This now means that the `CORREL(nation_name)` term in `H6` can be rephrased into a back-reference to the `nation_name` field in the copied version of `H3` (which is 3 levels above `H6`). This does change the join/aggregation keys when connecting H3 to child `$0`, since now we must connect each unique record of `H3` to the corresponding prepended section of `$0`. This is done by changing the join keys into the uniqueness keys of `H1`/`H2`/`H3`, which is aka `key` from `regions` (`BACK(1).key` from the perspective of `H3` and `BACK(4).key` from the perspective of `H6`) and `key` from `nations` (`key` from the perspective of `H3` and `BACK(3).key` from the perspective of `H6`). + +The result is the following: + + +```mermaid +flowchart TD + subgraph L1["(H1)"] + direction LR + A["Root"] + end + + subgraph R1["(H2)"] + direction LR + R["Collection: + regions"] + S["Where + name == 'EUROPE'"] + R --> S + end + + subgraph L2["(H3)"] direction LR - E["Collection: + E["Sub-collection: nations"] F["Calculate - name: name - n_expensive_orders: DEFAULT_TO($0.agg_0, 0)"] - G["Limit (5) - n_expensive_orders (descending)"] + nation_name: name"] + G["Calculate + nation_name + domestic_quantity = DEFAULT_TO($0.agg_0, 0)"] E --> F --> G end subgraph C2["($0)"] + subgraph L1'["(H1')"] + direction LR + A'["Root"] + end + + subgraph R1'["(H2')"] + direction LR + R'["Collection: + regions"] + S'["Where + name == 'EUROPE'"] + R' --> S' + end + + subgraph L2'["(H3')"] + direction LR + E'["Sub-collection: + nations"] + F'["Calculate + nation_name: name"] + E' --> F' + end subgraph L4["(H4)"] direction LR C["Sub-Collection: customers"] - H["Where - mktsgment == 'Building'"] - C --> H end subgraph L5["(H5)"] direction LR D["Sub-Collection: orders"] - D'["Where: - total_price >= 2.0 * CORREL(avg_price)"] - D --> D' + end + subgraph L6["(H6)"] + direction LR + V1["Sub-Collection: + lines"] + V2["Where + $0.name == BACK(3).nation_name"] + V1 --> V2 + end + subgraph C4["($0)"] + direction TB + subgraph H7["(H7)"] + Q1["Sub-Collection: + supplier"] + end + subgraph H8["(H8)"] + Q2["Sub-Collection: + nation"] + end + data2["Child Info: + join_keys: [(supplier_key, BACK(1).key)]"] end data1["Child Info: - join_keys: [(key, BACK(1).nation_key)] - agg_keys: [BACK(1).nation_key] - aggs: {'agg_0': COUNT()}"] + join_keys: [(key, BACK(3).key), (BACK(1).key, BACK(4).key)] + agg_keys: [BACK(3).key, BACK(4).key] + aggs: {'agg_0': SUM(quantity)}"] end - L1 --> L2 + L1 --> R1 + R1 --> L2 + L1' --> R1' + R1' --> L2' + L2' -->L4 L2 -.->|Aggregation| C2 L4 --> L5 + L5 --> L6 + L6 -.->|Singular| C4 + Q1 --> Q2 ``` - -### Hybrid Decorrelation +However, we can go one step further if we make an observation about this hybrid tree. Notice how now, we are computing the logic of `H1`, `H2`, and most of `H3` twice. This must happen if we intend to keep *every* nation, including the ones without any entries of `$0`. However, if we don't intend to keep nations where there are no records of `$0` to join onto, we can modify the original PyDough code as follows: -TODO +```py +selected_nations = regions.WHERE( + name == "EUROPE" +).nations.CALCULATE(nation_name=name) +domestic_lines = customers.orders.lines.WHERE(supplier.nation.name == nation_name) +nation_domestic_purchase_info = selected_nations +.WHERE(HAS(domestic_lines)).CALCULATE( + nation_name, + domestic_quantity=SUM(domestic_lines.quantity), +) +``` + +With this modification, the original hybrid tree changes so the access to child `$0` of `H3` is now an `AGGREGATION_ONLY_MATCH` access, meaning we do an inner join instead of a left join. The de-correlation procedure will notice this and, after transforming child `$0` into its de-correlated form, realize it can do an optimization to delete the original `H1`, `H2` and prefix of `H3` since all of the data required from them is accessible inside `$0`. This is done via the hybrid "pull up" node, which specifies that the data comes from the specified child, instead of having existing data that gets joined with data from the child. This will look as follows: + + +```mermaid +flowchart TD + subgraph L2["(H3)"] + direction LR + F["Pull Up ($0) + nation_name = $0.agg_1"] + G["Calculate + nation_name + domestic_quantity = DEFAULT_TO($0.agg_0, 0)"] + F --> G + end + + subgraph C2["($0)"] + subgraph L1'["(H1')"] + direction LR + A'["Root"] + end + + subgraph R1'["(H2')"] + direction LR + R'["Collection: + regions"] + S'["Where + name == 'EUROPE'"] + R' --> S' + end + + subgraph L2'["(H3')"] + direction LR + E'["Sub-collection: + nations"] + F'["Calculate + nation_name: name"] + E' --> F' + end + subgraph L4["(H4)"] + direction LR + C["Sub-Collection: + customers"] + end + subgraph L5["(H5)"] + direction LR + D["Sub-Collection: + orders"] + end + subgraph L6["(H6)"] + direction LR + V1["Sub-Collection: + lines"] + V2["Where + $0.name == BACK(3).nation_name"] + V1 --> V2 + end + subgraph C4["($0)"] + direction TB + subgraph H7["(H7)"] + Q1["Sub-Collection: + supplier"] + end + subgraph H8["(H8)"] + Q2["Sub-Collection: + nation"] + end + data2["Child Info: + join_keys: [(supplier_key, BACK(1).key)]"] + end + data1["Child Info: + agg_keys: [BACK(3).key, BACK(4).key] + aggs: {'agg_0': SUM(quantity) + 'agg_1': ANYTHING(BACK(3).nation_name)}"] + end + + L1' --> R1' + R1' --> L2' + L2' -->L4 + L2 -.->|Aggregation| C2 + L4 --> L5 + L5 --> L6 + L6 -.->|Singular| C4 + Q1 --> Q2 +``` + +The way to interpret this is that the entirety of child `$0` of H3 is evaluated, from `H1'` to `H6`, then grouped on `BACK(3).key` and `BACK(4).key` to calculate `agg_0` and `agg_1` (renaming the latter to `nation_name`), and that result is passed along from the PullUp node to the calculate in `H3` which builds on it. ## Relational Tree From f41499482d3bf3a31304c648c3c98ec2dea21c9e Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Thu, 24 Apr 2025 13:26:58 -0400 Subject: [PATCH 09/16] Fixing typo --- documentation/ir.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/ir.md b/documentation/ir.md index 038185f7b..5a264178d 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -560,7 +560,7 @@ flowchart TD L1' --> R1' R1' --> L2' L2' -->L4 - L2 -.->|Aggregation| C2 + L2 -.->|AggregationOnlyMatch| C2 L4 --> L5 L5 --> L6 L6 -.->|Singular| C4 From 3b23d0f4b733f55dde03017b73683de832746b7f Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Fri, 25 Apr 2025 12:53:05 -0400 Subject: [PATCH 10/16] Adding first relational example --- documentation/ir.md | 62 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) diff --git a/documentation/ir.md b/documentation/ir.md index 5a264178d..ac991d1a7 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -574,6 +574,68 @@ The way to interpret this is that the entirety of child `$0` of H3 is evaluated, TODO +For an example of the relational tree, consider the `nation_info` example from earlier. +```py +nation_info = nations.CALCULATE( + region_name=region.name, + nation_name=name, + n_orders_from_debt_customers=COUNT(customers.WHERE(acctbal < 0).orders) +).ORDER_BY(nation_name.ASC()) +``` + +Using the hybrid tree from earlier, the following is how the relational tree created by relational conversion could look: + +```mermaid +flowchart BT + S1["Scan + NATION"] + S2["Scan + REGION"] + J1["JOIN (left) + CONDITION: t0.region_key == t1.key + name = t0.name + key = t0.key + region_name = t1.name + "] + S3["Scan + CUSTOMER"] + F1["Filter + CONDITION: acctbal < 0"] + S4["Scan + ORDERS"] + J2["Join (inner) + CONDITION: t0.key == t1.cust_key + nation_key = t0.nation_key"] + A1["Aggregate + KEYS: {nation_key} + agg_0 = COUNT(*) + "] + J3["Join (left) + CONDITION: t0.key == t1.nation_key + region_name = t0.region_name + name = t0.name + agg_0 = t1.agg_0"] + P1["Project + region_name = region_name + nation_name = name + n_customers_in_debt = DEFAULT_TO(agg_0, 0)"] + R1["Root + COLUMNS: [nation_name, n_customers_in_debt] + ORDER: nation_name (ASC) + "] + + S1 --> J1 + S2 --> J1 + S3 --> F1 + F1 --> J2 + S4 --> J2 + J2 --> A1 + J1 --> J3 + A1 --> J3 + J3 --> P1 + P1 --> R1 +``` + ### Relational Conversion From 89cf690c712aa56321f077539a540d0eb5c938f2 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Sun, 27 Apr 2025 23:45:52 -0400 Subject: [PATCH 11/16] Filling out more sections to the IR guide [RUN CI] --- documentation/ir.md | 38 +++++++++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index ac991d1a7..16e71a51c 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -5,6 +5,7 @@ This document describes the various IRs used by PyDough to convert raw PyDough c - [Overview](#overview) +- [PyDough Operators](#pydough-operators) - [Unqualified Nodes](#unqualified-nodes) - [Unqualified Transform](#unqualified-transform) - [QDAG Nodes](#qdag-nodes) @@ -63,6 +64,14 @@ flowchart TD ``` + +## PyDough Operators + +Before jumping into the workflow, it is important to understand how the PyDough operators work since these are daisy-chained throughout the workflow and play a role in nearly every IR. + +TODO: FINISH THIS SECTION. + + ## Unqualified Nodes @@ -77,6 +86,13 @@ The same idea of Python functionality building new nodes on top of existing node > [!NOTE] > As a consequence of the `==` behavior, unqualified nodes should never be used as dictionary keys, cached via `@cache`, or compared with `==`, because this will always return a new unqualified node instead of a boolean indicating whether or not they are equal. Instead, it is better to convert the unqualified nodes to strings then check if the strings are equal. All unqualified nodes have a repr implementation that dumps their full structure (so `str` and `repr` should NOT be used as casting functions in PyDough). +Some examples of how the various unqualified nodes work: +- `UnqualifiedCalculate`: created by calling `x.CALCULATE(...)`. The `_parcel` contains two items: the unqualified node `x`, and a list of `(name, expr)` terms for the arguments to the `CALCULATE` where `name` is the name given to the term and `expr` is the unqualified node for the expression inside the `CALCULATE`. When `x.CALCULATE(...)` is called, every expression that is not passed in via a keyword argument is given a dummy name (e.g. `expr_0`) before so it can be passed in to `UnqualifiedCalculate` with a name. +- `UnqualifiedAccess`: created by accessing a field of any other unqualified node, e.g. `x.y`. The `_parcel` contains two items: the unqualified node `x` and the string `"y"` denoting the field to access from `x`. This represents an access of some property of `x` with a specific name, which could be an expression, collection, or an invalid access. + - Accessing a collection of PyDough is always done in the form `UnqualifiedAccess(root, "collection_name")` where `root` is an `UnqualifiedRoot` object, and `"collection_name"` is the name of one of the collections in the graph. + - Terms inside of a `CALCULATE` or similar expression are similarly phrased. For example, if doing `nations.CALCULATE(nation_name=name, region_name=region.name)`, the term for `nation_name` is `UnqualifiedAccess(root, "name")`, and the term for `region_name` is `UnqualifiedAccess(UnqualifiedAccess(root, "region"), "name")`. +- `UnqualifiedOperator`: created to represent a function operation. When a function is invoked in root level, e.g. `root.COUNT`, instead of creating an `UnqualifiedAccess`, PyDough can detect that this child of the root is a function name so it will return `UnqualifiedOperator("COUNT")`. This is possible because the `UnqualifiedRoot` internally stores information about the available function names. When this `UnqualifiedOperator` object is called as a function, e.g. `UnqualifiedOperator("COUNT")(...)`, it is transformed into a function call (either `UnqualifiedOperation` or `UnqualifiedWindow` depending on what kind). + Most magic methods have this sort of behavior, with some notable exceptions. For example, `__len__` is not allowed because any implementation of that magic method in Python must return an integer, so it cannot return a call to the `LENGTH` function in PyDough (unlike `__abs__`, which is implemented so when `abs` is called on an unqualified node it invokes the `ABS` PyDough function). @@ -111,7 +127,7 @@ flowchart RL B3B --> B3C[Access: 'customers'] B3C --> B3D[ROOT] - B3B -.-> P1[Call: + B3B -.-> P1[BinOp: '<'] P1 --> P2[Access: 'acctbal'] @@ -150,7 +166,7 @@ There are several variations of the logic that invoke this rewrite in different ## QDAG Nodes -TODO +TODO: FINISH THIS SECTION. For an example of the QDAG structure, consider the `nation_info` example earlier from the unqualified nodes: @@ -180,17 +196,17 @@ This has the following structure as QDAG nodes: ### Qualification -TODO +TODO: FINISH THIS SECTION. ## Hybrid Tree -TODO +TODO: FINISH THIS SECTION. ### Hybrid Conversion -TODO +TODO: FINISH THIS SECTION. For an example of the Hybrid Tree structure, consider the `nation_info` example earlier from the unqualified nodes: @@ -572,7 +588,7 @@ The way to interpret this is that the entirety of child `$0` of H3 is evaluated, ## Relational Tree -TODO +TODO: FINISH THIS SECTION. For an example of the relational tree, consider the `nation_info` example from earlier. ```py @@ -639,24 +655,24 @@ flowchart BT ### Relational Conversion -TODO +TODO: FINISH THIS SECTION. ### Relational Optimization -TODO +TODO: FINISH THIS SECTION. ## SQLGlot AST -TODO +TODO: FINISH THIS SECTION. ### SQLGlot Conversion -TODO +TODO: FINISH THIS SECTION. ### SQLGlot Optimization -TODO +TODO: FINISH THIS SECTION. From 5ae61fa07d7bab6125ed803c30c8a0133a2bd801 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Sun, 27 Apr 2025 23:50:33 -0400 Subject: [PATCH 12/16] Adding big important boxes around the TODO sections --- documentation/ir.md | 38 +++++++++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index 16e71a51c..784d8b54c 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -69,7 +69,8 @@ flowchart TD Before jumping into the workflow, it is important to understand how the PyDough operators work since these are daisy-chained throughout the workflow and play a role in nearly every IR. -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. @@ -166,7 +167,8 @@ There are several variations of the logic that invoke this rewrite in different ## QDAG Nodes -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. For an example of the QDAG structure, consider the `nation_info` example earlier from the unqualified nodes: @@ -196,17 +198,20 @@ This has the following structure as QDAG nodes: ### Qualification -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. ## Hybrid Tree -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. ### Hybrid Conversion -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. For an example of the Hybrid Tree structure, consider the `nation_info` example earlier from the unqualified nodes: @@ -588,7 +593,8 @@ The way to interpret this is that the entirety of child `$0` of H3 is evaluated, ## Relational Tree -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. For an example of the relational tree, consider the `nation_info` example from earlier. ```py @@ -652,27 +658,37 @@ flowchart BT P1 --> R1 ``` +The way to read this is as follows: + +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. + ### Relational Conversion -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. ### Relational Optimization -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. ## SQLGlot AST -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. ### SQLGlot Conversion -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. ### SQLGlot Optimization -TODO: FINISH THIS SECTION. +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. From 246d47c2a9efa1b3355a11752d0ef04948383542 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Sun, 27 Apr 2025 23:56:41 -0400 Subject: [PATCH 13/16] Moving around seciton --- documentation/ir.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index 784d8b54c..2d931022b 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -207,12 +207,6 @@ This has the following structure as QDAG nodes: > [!IMPORTANT] > TODO: FINISH THIS SECTION. - -### Hybrid Conversion - -> [!IMPORTANT] -> TODO: FINISH THIS SECTION. - For an example of the Hybrid Tree structure, consider the `nation_info` example earlier from the unqualified nodes: ```py @@ -288,6 +282,12 @@ In this example, the main hybrid tree has two levels H1 and `H2` (`H2` is the va - `$1` is an aggregation access, meaning the data from `$1` must first be aggregated before it is joined with `H2`. The aggregation is done by grouping on the `nation_key` field of the bottom subtree of `$1` (which is `H5`) and computes the term `agg_0` as `COUNT()`. Then, the result is joined with `H2` on the condition that the `key` term from `H2` equals the `nation_key` field just used to aggregate. - The contents of `$1` is two tree levels `H4` and `H5`. `H4` does not have any children and has a pipeline containing an access to the `customers` collection followed by a filter on the condition that `acctbal < 0`. `H5` does not have any children and only contains a single operation accessing the `orders` sub-collection of the parent level. + +### Hybrid Conversion + +> [!IMPORTANT] +> TODO: FINISH THIS SECTION. + ### Hybrid Decorrelation From 513c176e2694e033eea7351b9ed53d5ae7b3ea3d Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Sun, 27 Apr 2025 23:57:55 -0400 Subject: [PATCH 14/16] Adding missing backticks --- documentation/ir.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index 2d931022b..e4b4fa026 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -270,8 +270,8 @@ flowchart TD L4 --> L5 ``` -In this example, the main hybrid tree has two levels H1 and `H2` (`H2` is the value pointed to when the QDAG is converted to a hybrid tree). -- `H1`/`H2` have a parent/successor relationship where H1 is the parent and `H2` is the successor. +In this example, the main hybrid tree has two levels `H1` and `H2` (`H2` is the value pointed to when the QDAG is converted to a hybrid tree). +- `H1`/`H2` have a parent/successor relationship where `H1` is the parent and `H2` is the successor. - `H1` has no children and its pipeline has a single operation denoting the root level. - `H2` has two children (`$0` and `$1`) and its pipeline has 3 operations: - An access to the nations collection (how the step-down from `H1` to `H2` begins) From 961fc25c46b4b6de102291eb501d5c69d7eef949 Mon Sep 17 00:00:00 2001 From: knassre-bodo <105652923+knassre-bodo@users.noreply.github.com> Date: Tue, 6 May 2025 12:16:08 -0400 Subject: [PATCH 15/16] Apply suggestions from code review Adding Hadia's revisions Co-authored-by: Hadia Ahmed --- documentation/ir.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index e4b4fa026..2dd817ef9 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -26,11 +26,11 @@ This document describes the various IRs used by PyDough to convert raw PyDough c ## Overview The overarching pipeline for converting PyDough Python text into SQL text is as follows: -1. The Python text is intercepted and re-written in a [transformation](#unqualified-transform) that replaces undefined variable names with certain objects, ensuring that when the Python code is executed the result is an [unqualified node](#unqualified-nodes). These unqualified nodes are very minimal in terms of information stored, +1. The Python text is intercepted and re-written in a [transformation](#unqualified-transform) that replaces undefined variable names with certain objects, ensuring that when the Python code is executed the result is an [unqualified node](#unqualified-nodes). These unqualified nodes are very minimal in terms of information stored. 2. When `to_sql` or `to_df` is called on this unqualified nodes, it is first sent through a process called [qualification](#qualification) which converts unqualified nodes into qualified DAG nodes, or [QDAG nodes](#qdag) for short. These QDAG nodes utilize the metadata in order to correctly associate every aspect of the PyDough logic with the data being analyzed/transformed, and is also where a great deal of the verification of the PyDough code's validity happens. -3. Next, the QDAG nodes are run through a process called [hybrid conversion](#hybrid-conversion) which restructures the logic into the datastructure known as the [hybrid tree](#hybrid-tree) in order to better organize the types of ways different subtrees of the data are linked together. +3. Next, the QDAG nodes are run through a process called [hybrid conversion](#hybrid-conversion) which restructures the logic into a data structure known as the [hybrid tree](#hybrid-tree) to better organize how different types of subtrees are linked together. 4. The hybrid tree is further transformed by the [decorrelation procedure](#hybrid-decorrelation) to remove correlated references created by the hybrid conversion process. -5. The transformed hybrid tree is converted into a [relational tree](#relational-tree) highly reminiscent of the datastructure used to represent relational algebra in frameworks such as Apache Calcite. The conversion to this datastructure is called [relational conversion](#relational-conversion). +5. The transformed hybrid tree is converted into a [relational tree](#relational-tree) highly reminiscent of the data structure used to represent relational algebra in frameworks such as Apache Calcite. The conversion to this data structure is called [relational conversion](#relational-conversion). 6. Several [optimizations](#relational-optimization) are performed on the relational tree to combine/delete/split/transpose relational nodes, resulting in plans that are better for performance and/or visual quality when converted to SQL. 7. The Relational tree is [converted](#sqlglot-conversion) into the [internal AST](#sqlglot-ast) used by the open source Python library SQLGlot. This library is used for transpiling between different SQL dialects, so it is trivial to convert the SQLglot AST into SQL text of many different dialects. 8. The SQLGlot AST is [simplified & optimized](#sqlglot-optimization) less so to improve the performance of the SQL when executed, and moreso to improve the visual quality when it is converted into text. @@ -53,7 +53,7 @@ flowchart TD D --> E[Relational Tree] E <--> E'{{Relational - Optimiation}} + Optimization}} F <--> F'{{SQLGlot Optimization}} E -->|SQLGlot @@ -81,14 +81,14 @@ The first intermediary representation of PyDough are the unqualified nodes, whic The reason this `_parcel` name is special (and RESERVED) is because whenever getattr is used on an UnqualifiedNode, except for very few names, the logic will instead return a new node referring to an access of a term from the previous node. E.g. if `foo` is an unqualified node and I write `foo.bar`, this returns `UnqualifiedAccess(foo, "bar")` which means "access term `bar` from `foo`". The same idea of Python functionality building new nodes on top of existing nodes also goes for PyDough operations & Python magic methods. For example: -- `foo.CALCULATE(...)` returns an unqualified node for a calculate operation that points to `foo` in its parcel as the thing it is building on top of. +- `foo.CALCULATE(...)` returns an unqualified node for a calculate operation that points to `foo` in its `_parcel` as the thing it is building on top of. - `x == y` returns an unqualified node for a function call where the operator is `==` and the operands are `x` and `y`. > [!NOTE] > As a consequence of the `==` behavior, unqualified nodes should never be used as dictionary keys, cached via `@cache`, or compared with `==`, because this will always return a new unqualified node instead of a boolean indicating whether or not they are equal. Instead, it is better to convert the unqualified nodes to strings then check if the strings are equal. All unqualified nodes have a repr implementation that dumps their full structure (so `str` and `repr` should NOT be used as casting functions in PyDough). Some examples of how the various unqualified nodes work: -- `UnqualifiedCalculate`: created by calling `x.CALCULATE(...)`. The `_parcel` contains two items: the unqualified node `x`, and a list of `(name, expr)` terms for the arguments to the `CALCULATE` where `name` is the name given to the term and `expr` is the unqualified node for the expression inside the `CALCULATE`. When `x.CALCULATE(...)` is called, every expression that is not passed in via a keyword argument is given a dummy name (e.g. `expr_0`) before so it can be passed in to `UnqualifiedCalculate` with a name. +- `UnqualifiedCalculate`: created by calling `x.CALCULATE(...)`. The `_parcel` contains two items: the unqualified node `x`, and a list of `(name, expr)` terms for the arguments to the `CALCULATE` where `name` is the name given to the term and `expr` is the unqualified node for the expression inside the `CALCULATE`. When `x.CALCULATE(...)` is called, every expression that is not passed in via a keyword argument is given a dummy name (e.g. `expr_0`) so it can be passed in to `UnqualifiedCalculate` with a name. - `UnqualifiedAccess`: created by accessing a field of any other unqualified node, e.g. `x.y`. The `_parcel` contains two items: the unqualified node `x` and the string `"y"` denoting the field to access from `x`. This represents an access of some property of `x` with a specific name, which could be an expression, collection, or an invalid access. - Accessing a collection of PyDough is always done in the form `UnqualifiedAccess(root, "collection_name")` where `root` is an `UnqualifiedRoot` object, and `"collection_name"` is the name of one of the collections in the graph. - Terms inside of a `CALCULATE` or similar expression are similarly phrased. For example, if doing `nations.CALCULATE(nation_name=name, region_name=region.name)`, the term for `nation_name` is `UnqualifiedAccess(root, "name")`, and the term for `region_name` is `UnqualifiedAccess(UnqualifiedAccess(root, "region"), "name")`. @@ -148,7 +148,7 @@ In the example above, the value of the `nation_info` is the `OrderBy` node on th ### Unqualified Transform -The unqualified nodes are created by executing Python code after a transformation is applied to modify the text. This modification creates a variable at the top of the code block called `_ROOT` which is an unqualified root, then replaces all undefined variables `x` with `_ROOT.x`. For example, the `nation_info` example earlier would be rewritten into the following, which then gets executed with `graph` passed in to the environment. +The unqualified nodes are created by executing Python code after a transformation is applied to modify the text. This transformation inserts a `_ROOT` variable at the top of the code block to represent the unqualified root. It then replaces all undefined variables `x` with `_ROOT.x` For example, the `nation_info` example earlier would be rewritten into the following, which then gets executed with `graph` passed in to the environment. ``` _ROOT = UnqualifiedRoot(graph) @@ -280,7 +280,7 @@ In this example, the main hybrid tree has two levels `H1` and `H2` (`H2` is the - `$0` is a singular access, meaning the data from `H2` and `$0` can be directly joined without needing to worry about changes in cardinality. They are joined on the condition that the `region_key` term from `H2` equals the `key` term from the bottom subtree of `$0` (which is `H3`). - The contents of `$0` is just a single tree level `H3` which does not have any children and has a pipeline containing only an access to the `regions` collection. - `$1` is an aggregation access, meaning the data from `$1` must first be aggregated before it is joined with `H2`. The aggregation is done by grouping on the `nation_key` field of the bottom subtree of `$1` (which is `H5`) and computes the term `agg_0` as `COUNT()`. Then, the result is joined with `H2` on the condition that the `key` term from `H2` equals the `nation_key` field just used to aggregate. - - The contents of `$1` is two tree levels `H4` and `H5`. `H4` does not have any children and has a pipeline containing an access to the `customers` collection followed by a filter on the condition that `acctbal < 0`. `H5` does not have any children and only contains a single operation accessing the `orders` sub-collection of the parent level. + - The contents of `$1` is two levels `H4` and `H5`. `H4` does not have any children and has a pipeline containing an access to the `customers` collection followed by a filter on the condition that `acctbal < 0`. `H5` does not have any children and only contains a single operation accessing the `orders` sub-collection of the parent level. ### Hybrid Conversion @@ -292,7 +292,7 @@ In this example, the main hybrid tree has two levels `H1` and `H2` (`H2` is the ### Hybrid Decorrelation -To understand why de-correlation matters, first consider the slightly more complex PyDough code example below. This PyDough code finds, for each nation in Europe, the total purchase quantity made by customers in that nation from suppliers in the same nation. 5 nations with the largest number of orders made by customers in that nation in the building market segment where the total price of the order is at least double the average of the total prices of **all** orders. +To understand why de-correlation matters, first consider the slightly more complex PyDough code example below. This PyDough code finds, for each nation in Europe, the total quantity of purchases made by customers from suppliers in the same nation. 5 nations with the largest number of orders made by customers in that nation in the building market segment where the total price of the order is at least double the average of the total prices of **all** orders. ```py selected_nations = regions.WHERE( @@ -382,9 +382,9 @@ flowchart TD Q1 --> Q2 ``` -Notice that `H6` contains a filter condition `$0.name == CORREL(nation_name)`. This means that even though `H6` is inside child `$0` of `H3` (not to be confused with child `$0` of `H6`), it still references the `nation_name` field from `H3`. This is a called a correlated reference, because it means that child `$0` of `H3` requires information from `H3` in order to be derived, but the whole point is that the data from child `$0` is calculated then joined onto he data from `H3`, so there is a catch-22. +Notice that `H6` contains a filter condition `$0.name == CORREL(nation_name)`, indicating a correlated reference. Although `H6` is nested within child `$0` of `H3`, it still references the `nation_name` field from `H3`. This is called a correlated reference — child `$0` of `H3` depends on `H3`'s data (`nation_name`) to compute its result. But since this data is meant to be computed before joining with `H3`, this creates a catch-22. -To fix this, we next run the de-correlation procedure. This will recursively traverse the entire tree and look for hybrid nodes that have a correlated child (where the child type is NOT semi/anti join, since those two do allow correlated references). This procedure will reach tree `H3` and see that its child `$0` is correlated and is an `AGGREGATION` connection, and will therefore de-correlate it. It does so by copying the entire hybrid tree so far (`H1`, `H2`, and the first two operators from `H3`) and attaching them to the top of child `$0`, above `H4`. This now means that the `CORREL(nation_name)` term in `H6` can be rephrased into a back-reference to the `nation_name` field in the copied version of `H3` (which is 3 levels above `H6`). This does change the join/aggregation keys when connecting H3 to child `$0`, since now we must connect each unique record of `H3` to the corresponding prepended section of `$0`. This is done by changing the join keys into the uniqueness keys of `H1`/`H2`/`H3`, which is aka `key` from `regions` (`BACK(1).key` from the perspective of `H3` and `BACK(4).key` from the perspective of `H6`) and `key` from `nations` (`key` from the perspective of `H3` and `BACK(3).key` from the perspective of `H6`). +To fix this, we next run the de-correlation procedure. This procedure recursively traverses the entire tree, looking for hybrid nodes with correlated children (excluding semi and anti joins, which do allow correlated references). This procedure will reach tree `H3` and see that its child `$0` is correlated and is an `AGGREGATION` connection, and will therefore de-correlate it. To do this, it copies the hybrid tree built so far (`H1`, `H2`, and the first two operators of `H3`) and attaches them above child `$0`, specifically before `H4`. This now means that the `CORREL(nation_name)` term in `H6` can be rephrased into a back-reference to the `nation_name` field in the copied version of `H3` (which is 3 levels above `H6`). This changes the join and aggregation keys used when connecting `H3` to child `$0`, as each unique record from `H3` must now align with the prepended section of `$0`. This is done by changing the join keys into the uniqueness keys of `H1`/`H2`/`H3`, the `key` from `regions` (accessed as `BACK(1).key` from`H3` and `BACK(4).key` from `H6`) and the `key` from `nations` (`key` from `H3` and `BACK(3).key` from `H6`). The result is the following: @@ -489,7 +489,7 @@ flowchart TD Q1 --> Q2 ``` -However, we can go one step further if we make an observation about this hybrid tree. Notice how now, we are computing the logic of `H1`, `H2`, and most of `H3` twice. This must happen if we intend to keep *every* nation, including the ones without any entries of `$0`. However, if we don't intend to keep nations where there are no records of `$0` to join onto, we can modify the original PyDough code as follows: +However, we can go one step further if we make an observation about this hybrid tree. Notice how now, we are computing the logic of `H1`, `H2`, and most of `H3` twice. This must happen if we intend to keep *every* nation, including the ones without any entries of `$0`. However, if we don't intend to keep nations that lack corresponding `$0` records to join with, we can modify the original PyDough code as follows: ```py selected_nations = regions.WHERE( @@ -503,7 +503,7 @@ nation_domestic_purchase_info = selected_nations ) ``` -With this modification, the original hybrid tree changes so the access to child `$0` of `H3` is now an `AGGREGATION_ONLY_MATCH` access, meaning we do an inner join instead of a left join. The de-correlation procedure will notice this and, after transforming child `$0` into its de-correlated form, realize it can do an optimization to delete the original `H1`, `H2` and prefix of `H3` since all of the data required from them is accessible inside `$0`. This is done via the hybrid "pull up" node, which specifies that the data comes from the specified child, instead of having existing data that gets joined with data from the child. This will look as follows: +With this modification, the original hybrid tree is modified so that access to child `$0` of `H3` is now an `AGGREGATION_ONLY_MATCH` access, meaning we do an inner join instead of a left join. The de-correlation procedure will notice this and, after transforming child `$0` into its de-correlated form, recognize an opportunity to optimize by removing the original `H1`, `H2` and prefix of `H3` since all of the data required from them is accessible inside `$0`. This is done via the hybrid "pull up" node, which specifies that the data comes from the specified child, instead of having existing data that gets joined with data from the child. This will look as follows: ```mermaid @@ -588,7 +588,7 @@ flowchart TD Q1 --> Q2 ``` -The way to interpret this is that the entirety of child `$0` of H3 is evaluated, from `H1'` to `H6`, then grouped on `BACK(3).key` and `BACK(4).key` to calculate `agg_0` and `agg_1` (renaming the latter to `nation_name`), and that result is passed along from the PullUp node to the calculate in `H3` which builds on it. +This means the entire child `$0` subtree of `H3` is evaluated, from `H1'` to `H6`, then grouped on `BACK(3).key` and `BACK(4).key` to calculate `agg_0` and `agg_1` (renaming the latter to `nation_name`), and that result is passed along from the PullUp node to the calculate in `H3` which builds on it. ## Relational Tree @@ -596,7 +596,7 @@ The way to interpret this is that the entirety of child `$0` of H3 is evaluated, > [!IMPORTANT] > TODO: FINISH THIS SECTION. -For an example of the relational tree, consider the `nation_info` example from earlier. +To illustrate the relational tree, consider the `nation_info` example from earlier. ```py nation_info = nations.CALCULATE( region_name=region.name, @@ -605,7 +605,7 @@ nation_info = nations.CALCULATE( ).ORDER_BY(nation_name.ASC()) ``` -Using the hybrid tree from earlier, the following is how the relational tree created by relational conversion could look: +Using the hybrid tree from earlier, here’s how the relational tree created by the conversion would look: ```mermaid flowchart BT From 13e82a81e7384f452aafdf835f99daac1fc75360 Mon Sep 17 00:00:00 2001 From: knassre-bodo Date: Tue, 6 May 2025 12:27:06 -0400 Subject: [PATCH 16/16] Further revisions --- documentation/ir.md | 49 ++++++++++++++++++++++++++------------------- 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/documentation/ir.md b/documentation/ir.md index 2dd817ef9..b82e35fac 100644 --- a/documentation/ir.md +++ b/documentation/ir.md @@ -30,7 +30,7 @@ The overarching pipeline for converting PyDough Python text into SQL text is as 2. When `to_sql` or `to_df` is called on this unqualified nodes, it is first sent through a process called [qualification](#qualification) which converts unqualified nodes into qualified DAG nodes, or [QDAG nodes](#qdag) for short. These QDAG nodes utilize the metadata in order to correctly associate every aspect of the PyDough logic with the data being analyzed/transformed, and is also where a great deal of the verification of the PyDough code's validity happens. 3. Next, the QDAG nodes are run through a process called [hybrid conversion](#hybrid-conversion) which restructures the logic into a data structure known as the [hybrid tree](#hybrid-tree) to better organize how different types of subtrees are linked together. 4. The hybrid tree is further transformed by the [decorrelation procedure](#hybrid-decorrelation) to remove correlated references created by the hybrid conversion process. -5. The transformed hybrid tree is converted into a [relational tree](#relational-tree) highly reminiscent of the data structure used to represent relational algebra in frameworks such as Apache Calcite. The conversion to this data structure is called [relational conversion](#relational-conversion). +5. The transformed hybrid tree is converted into a [relational tree](#relational-tree) highly reminiscent of the data structure used to represent relational algebra in frameworks such as [Apache Calcite](https://calcite.apache.org/docs/algebra.html). The conversion to this data structure is called [relational conversion](#relational-conversion). 6. Several [optimizations](#relational-optimization) are performed on the relational tree to combine/delete/split/transpose relational nodes, resulting in plans that are better for performance and/or visual quality when converted to SQL. 7. The Relational tree is [converted](#sqlglot-conversion) into the [internal AST](#sqlglot-ast) used by the open source Python library SQLGlot. This library is used for transpiling between different SQL dialects, so it is trivial to convert the SQLglot AST into SQL text of many different dialects. 8. The SQLGlot AST is [simplified & optimized](#sqlglot-optimization) less so to improve the performance of the SQL when executed, and moreso to improve the visual quality when it is converted into text. @@ -40,27 +40,29 @@ To recap, the overall pipeline is as follows (if viewing with VSCode preview, yo ```mermaid flowchart TD A[Python - Text] -->|"Unqualified + Text] -->|"(1) Unqualified Transform"| B(Unqualified Node) - B -->|Qualification| C[QDAG + B -->|"(2) Qualification"| C[QDAG Node] B'[Metadata] -->C - C -->|"Hybrid + C -->|"(3) Hybrid Conversion"| D[Hybrid Tree] - D <--> D'{{Hybrid - Decorrelation}} - D --> E[Relational - Tree] - E <--> E'{{Relational - Optimization}} - F <--> F'{{SQLGlot - Optimization}} - E -->|SQLGlot - Conversion| F[SQLGlot + D <--> D'{{"(4) Hybrid + Decorrelation"}} + D -->|"(5) Relational + Conversion"| E["Relational + Tree"] + E <--> E'{{"(6) Relational + Optimization"}} + F <--> F'{{"(8) SQLGlot + Optimization"}} + E -->|"(7) SQLGlot + Conversion"| F[SQLGlot AST] - F --> G[SQL - Text] + F -->|"(9) SQLGlot + API"| G["SQL + Text"] ``` @@ -185,7 +187,7 @@ This has the following structure as QDAG nodes: ``` ──┬─ TPCH ├─── TableCollection[nations] - ├─┬─ Calculate[region_name=$1.name, nation_name=name, n_customers_in_debt=COUNT($2)] + ├─┬─ Calculate[region_name=$1.name, nation_name=name, n_orders_from_debt_customers=COUNT($2)] │ ├─┬─ AccessChild │ │ └─── SubCollection[region] │ └─┬─ AccessChild @@ -277,10 +279,15 @@ In this example, the main hybrid tree has two levels `H1` and `H2` (`H2` is the - An access to the nations collection (how the step-down from `H1` to `H2` begins) - A calculate that defines `region_name`, `nation_name`, and `n_customers_in_debt` - An order-by that sorts by `nation_name` in ascending order. -- `$0` is a singular access, meaning the data from `H2` and `$0` can be directly joined without needing to worry about changes in cardinality. They are joined on the condition that the `region_key` term from `H2` equals the `key` term from the bottom subtree of `$0` (which is `H3`). +- `$0` is a singular access, meaning the data from `H2` and `$0` can be directly joined without needing to worry about changes in cardinality. - The contents of `$0` is just a single tree level `H3` which does not have any children and has a pipeline containing only an access to the `regions` collection. -- `$1` is an aggregation access, meaning the data from `$1` must first be aggregated before it is joined with `H2`. The aggregation is done by grouping on the `nation_key` field of the bottom subtree of `$1` (which is `H5`) and computes the term `agg_0` as `COUNT()`. Then, the result is joined with `H2` on the condition that the `key` term from `H2` equals the `nation_key` field just used to aggregate. - - The contents of `$1` is two levels `H4` and `H5`. `H4` does not have any children and has a pipeline containing an access to the `customers` collection followed by a filter on the condition that `acctbal < 0`. `H5` does not have any children and only contains a single operation accessing the `orders` sub-collection of the parent level. + - The data from `$0` will be joined back onto `H2` on the condition that the `region_key` term from `H2` equals the `key` term from the bottom subtree of `$0` (which is `H3`). +- `$1` is an aggregation access of `customers.orders`, meaning the data from `$1` must first be aggregated before it is joined with `H2`. + - The contents of `$1` is two levels `H4` and `H5`. + - `H4` does not have any children and has a pipeline containing an access to the `customers` collection followed by a filter on the condition that `acctbal < 0`. + - `H5` does not have any children and only contains a single operation accessing the `orders` sub-collection of the parent level. + - The aggregation will be done by grouping on the `nation_key` field of the bottom subtree of `$1` (which is `H5`) and computes the term `agg_0` as `COUNT()`. + - After aggregation, the result will be joined with `H2` on the condition that the `key` term from `H2` equals the `nation_key` field just used to aggregate. ### Hybrid Conversion @@ -292,7 +299,7 @@ In this example, the main hybrid tree has two levels `H1` and `H2` (`H2` is the ### Hybrid Decorrelation -To understand why de-correlation matters, first consider the slightly more complex PyDough code example below. This PyDough code finds, for each nation in Europe, the total quantity of purchases made by customers from suppliers in the same nation. 5 nations with the largest number of orders made by customers in that nation in the building market segment where the total price of the order is at least double the average of the total prices of **all** orders. +To understand why de-correlation matters, first consider the slightly more complex PyDough code example below. This PyDough code finds, for each nation in Europe, the total quantity of purchases made by customers from suppliers in the same nation. ```py selected_nations = regions.WHERE(