-
Notifications
You must be signed in to change notification settings - Fork 190
feat: add AggregateRel compatibility for grouping set #890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
e0104c5
4a2a735
2949a95
f8f32d7
028b6d3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -350,6 +350,8 @@ message AggregateRel { | |
| // `Grouping.expression_references`. | ||
| repeated Expression grouping_expressions = 5; | ||
|
|
||
| Compatibility compatibility = 6; | ||
|
|
||
| substrait.extensions.AdvancedExtension advanced_extension = 10; | ||
|
|
||
| message Grouping { | ||
|
|
@@ -370,6 +372,26 @@ message AggregateRel { | |
| // Helps to support SUM(<c>) FILTER(WHERE...) syntax without masking opportunities for optimization | ||
| Expression filter = 2; | ||
| } | ||
|
|
||
| // Various modes of operations of AggregateRel to capture different behaviors across systems. | ||
| message Compatibility { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a bit leery of having many submessages with similar names as that will complicate parsing. But having one compatibility message with many unused parts is similarly unsatisfying. The options behavior of functions is probably the most appropriate. |
||
| // Defines the behavior of AggregateRel when there is an empty grouping set in the `groupings` | ||
| // and the input is empty. An empty grouping set is an aggregation over the entire input and some | ||
| // systems implement different behaviors when the input is empty. | ||
| enum EmptyGroupingSetOnEmptyInput { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My natural inclination is to give an enum a name that captures:
but here that would give us something like |
||
| // Default is `EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS`. | ||
| EMPTY_GROUPING_SET_ON_EMPTY_INPUT_UNSPECIFIED = 0; | ||
| // If there is an empty grouping set in the `groupings`, the AggregateRel yields a single row | ||
| // for the empty grouping set on empty input (i.e., explicit grouping over the entire input). | ||
| // For example, AggregateRel[(), COUNT] yields one record of value 0 when the input is empty. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we can use |
||
| EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS = 1; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor |
||
| // The AggregateRel yields no row for the empty grouping set on empty input (i.e., grouping over the rows). | ||
| // For example, AggregateRel[(), COUNT] yields no record when the input is empty. | ||
| EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_NO_ROWS = 2; | ||
| } | ||
|
|
||
| EmptyGroupingSetOnEmptyInput empty_grouping_set_on_empty_input = 1; | ||
| } | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm off two minds on declaring settings like this. On one hand, having single message with a bunch of boolean behavioral toggles makes it easy to add new toggles, because we can just add a new field. We would need to make sure that the default unset value matched the default behaviour when we do this. Generally though, I'm wary of boolean toggles because IMO they can be hard to understand, and are limited to switching between 2 different behaviors. I personally lean towards the enum style of setting toggles because we can indicate the expected behavior with the name, we can declare more than 2 types of behaviors, we can add behaviors easily if we discover more weird system behaviour and we can explicitly declare the unset values as unspecified.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When we add these kinds of compatibility toggles, we should also document them in the website. It would be good to include the systems this is useful for, as well as example queries to trigger the behaviour in the docs for context as well.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I plan to add the documentation -- the weeks have been crazy, can't find time... perhaps later this week or next week, I'll update the PR with the documentation.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vbarua I agree with the enum if there are more than two options but I don't see this particular one having the third options. The message does not have to be a collection of booleans. If a field has more than two options in the future, oh well, that field should be enum. If you do want to see enum, please let me know!
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even with 2 options, I still have a strong preference for the enum version as it also makes it possible to check if the compatibility option has been set explicitly or not. I also find it's easier to document behaviour via the enum name. If I see values like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vbarua In the proto or doc?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, if this is a theme, we are effectively banning the usage of boolean in the Substrait, which is fine as this also happens in normal programming language -- passing 0 or boolean as function argument vs. enum. If we agree on, we should go over the spec and clean up all boolean and replace with enums in 1.0 perhaps.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vbarua changed to enum. PTAL!
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I would say heavily discouraging 😅 Aside from
I was thinking in the doc. This is surprisingly weird behaviour. I was testing with something like SELECT COUNT(*), COUNT(id), SUM(id), STRING_AGG(s, ',')
FROM test;on db-fiddle which outputs Trino does So it looks like the behaviour on empty inputs is that the count functions return 0, and all other functions return null.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. COUNT is special. All other aggregates are supposed to yield NULL over empty input. |
||
| } | ||
|
|
||
| // ConsistentPartitionWindowRel provides the ability to perform calculations across sets of rows | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -407,6 +407,27 @@ If at least one grouping expression is present, the aggregation is allowed to no | |
| | Per Grouping Set | A list of expression grouping that the aggregation measured should be calculated for. | Optional. | | ||
| | Measures | A list of one or more aggregate expressions along with an optional filter. | Optional, required if no grouping sets. | | ||
|
|
||
| ### Aggregate Compatibility | ||
|
|
||
| The aggregate operation is one of the most complex operations in the spec. Although implementations mostly agree on behaviors, there may be gaps in corner cases. Those behavioral differences are captured in compatibility. | ||
|
|
||
| NOTE: The compatibility is meant to address gaps in the core implementation of aggregation such as grouping sets. For custom aggregations, consider using aggregate extension functions. If you want to introduce a new compatibility mode, reach out Substrait PMC to discuss. | ||
|
|
||
| #### Empty Grouping Set on Empty Input | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should document what the encoding of an empty grouping looks like in the protobuf to make it clear for users what the condition they need to detect is. |
||
|
|
||
| This compatibility mode defines how the AggregateRel behaves with empty grouping set on an empty input. Default is `EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS`. | ||
|
|
||
| | Mode | Behavior | Example Systems | | ||
| | -------------------------------------------------|-------------------------------|-----------------| | ||
| | EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS | A row for empty grouping set | PostgreSQL | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should document what the row should contain as well. |
||
| | EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_NO_ROWS | No row for empty grouping set | Microsoft SQL Sever family, Oracle | | ||
|
|
||
| **Example:** | ||
| ```sql | ||
| -- The following two SQL statements yields a single row with value 0 in the systems DO NOT require this compatibility. | ||
| SELECT COUNT(*) FROM T -- [(0)] when T is empty. | ||
| SELECT COUNT(*) FROM T GROUP BY GROUNPING SETS (()) -- [] when T is empty in systems requiring this compatibility. | ||
| ``` | ||
|
|
||
| === "AggregateRel Message" | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.