Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,8 @@ message AggregateRel {
// `Grouping.expression_references`.
repeated Expression grouping_expressions = 5;

Compatibility compatibility = 6;

substrait.extensions.AdvancedExtension advanced_extension = 10;

message Grouping {
Expand All @@ -370,6 +372,26 @@ message AggregateRel {
// Helps to support SUM(<c>) FILTER(WHERE...) syntax without masking opportunities for optimization
Expression filter = 2;
}

// Various modes of operations of AggregateRel to capture different behaviors across systems.
message Compatibility {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit leery of having many submessages with similar names as that will complicate parsing. But having one compatibility message with many unused parts is similarly unsatisfying. The options behavior of functions is probably the most appropriate.

// Defines the behavior of AggregateRel when there is an empty grouping set in the `groupings`
// and the input is empty. An empty grouping set is an aggregation over the entire input and some
// systems implement different behaviors when the input is empty.
enum EmptyGroupingSetOnEmptyInput {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My natural inclination is to give an enum a name that captures:

  1. The visible behaviour it modifies
  2. The condition to trigger it

but here that would give us something like RowOutputOnEmptyGroupingSetOnEmptyInput which I'm not 100% sure is worth the verbosity.

// Default is `EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS`.
EMPTY_GROUPING_SET_ON_EMPTY_INPUT_UNSPECIFIED = 0;
// If there is an empty grouping set in the `groupings`, the AggregateRel yields a single row
// for the empty grouping set on empty input (i.e., explicit grouping over the entire input).
// For example, AggregateRel[(), COUNT] yields one record of value 0 when the input is empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use AggregateRel[(), COUNT] as our example, because we don't have a text format defined for something like this.

EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor

EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS
to
EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROW

// The AggregateRel yields no row for the empty grouping set on empty input (i.e., grouping over the rows).
// For example, AggregateRel[(), COUNT] yields no record when the input is empty.
EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_NO_ROWS = 2;
}

EmptyGroupingSetOnEmptyInput empty_grouping_set_on_empty_input = 1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm off two minds on declaring settings like this. On one hand, having single message with a bunch of boolean behavioral toggles makes it easy to add new toggles, because we can just add a new field. We would need to make sure that the default unset value matched the default behaviour when we do this. Generally though, I'm wary of boolean toggles because IMO they can be hard to understand, and are limited to switching between 2 different behaviors.

I personally lean towards the enum style of setting toggles because we can indicate the expected behavior with the name, we can declare more than 2 types of behaviors, we can add behaviors easily if we discover more weird system behaviour and we can explicitly declare the unset values as unspecified.

message EmptyInputMode {
      OUTPUT_MODE_UNSPECIFIED = 0;
      OUTPUT_MODE_YIELD_EMPTY_ROW = 1;
      OUTPUT_MODE_YIELD_NO_ROW = 1;
}

Copy link
Member

@vbarua vbarua Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add these kinds of compatibility toggles, we should also document them in the website. It would be good to include the systems this is useful for, as well as example queries to trigger the behaviour in the docs for context as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I plan to add the documentation -- the weeks have been crazy, can't find time... perhaps later this week or next week, I'll update the PR with the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbarua I agree with the enum if there are more than two options but I don't see this particular one having the third options. The message does not have to be a collection of booleans. If a field has more than two options in the future, oh well, that field should be enum.

If you do want to see enum, please let me know!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with 2 options, I still have a strong preference for the enum version as it also makes it possible to check if the compatibility option has been set explicitly or not. I also find it's easier to document behaviour via the enum name. yield_no_rows_on_empty_input=true makes it clear that I shouldn't output a row, but yield_no_rows_on_empty_input=false isn't as explicit. How many rows should I output? What should be in the row, if anything. I would probably need to check the documentation.

If I see values like YIELD_EMPTY_ROW and YIELD_NO_ROW on the other hand, its very clear what I should be doing and I won't need to go look at the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbarua In the proto or doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if this is a theme, we are effectively banning the usage of boolean in the Substrait, which is fine as this also happens in normal programming language -- passing 0 or boolean as function argument vs. enum. If we agree on, we should go over the spec and clean up all boolean and replace with enums in 1.0 perhaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbarua changed to enum. PTAL!

Copy link
Member

@vbarua vbarua Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are effectively banning the usage of boolean in the Substrait,

I would say heavily discouraging 😅

Aside from nullable, I don't think there's that many. But yes something to keep in mind for 1.0.

In the proto or doc?

I was thinking in the doc. This is surprisingly weird behaviour. I was testing with something like

SELECT COUNT(*), COUNT(id), SUM(id), STRING_AGG(s, ',')
FROM test;

on db-fiddle which outputs

(0, 0, null, null)

Trino does

trino> WITH test(i, s) AS (VALUES (1, 'a'), (2, 'b') LIMIT 0)
    -> SELECT COUNT(*), COUNT(i), SUM(i), LISTAGG(s) WITHIN GROUP (ORDER BY s) FROM test;
 _col0 | _col1 | _col2 | _col3
-------+-------+-------+-------
     0 |     0 |  NULL | NULL
(1 row)

So it looks like the behaviour on empty inputs is that the count functions return 0, and all other functions return null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COUNT is special. All other aggregates are supposed to yield NULL over empty input.

}

// ConsistentPartitionWindowRel provides the ability to perform calculations across sets of rows
Expand Down
21 changes: 21 additions & 0 deletions site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -407,6 +407,27 @@ If at least one grouping expression is present, the aggregation is allowed to no
| Per Grouping Set | A list of expression grouping that the aggregation measured should be calculated for. | Optional. |
| Measures | A list of one or more aggregate expressions along with an optional filter. | Optional, required if no grouping sets. |

### Aggregate Compatibility

The aggregate operation is one of the most complex operations in the spec. Although implementations mostly agree on behaviors, there may be gaps in corner cases. Those behavioral differences are captured in compatibility.

NOTE: The compatibility is meant to address gaps in the core implementation of aggregation such as grouping sets. For custom aggregations, consider using aggregate extension functions. If you want to introduce a new compatibility mode, reach out Substrait PMC to discuss.

#### Empty Grouping Set on Empty Input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document what the encoding of an empty grouping looks like in the protobuf to make it clear for users what the condition they need to detect is.


This compatibility mode defines how the AggregateRel behaves with empty grouping set on an empty input. Default is `EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS`.

| Mode | Behavior | Example Systems |
| -------------------------------------------------|-------------------------------|-----------------|
| EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS | A row for empty grouping set | PostgreSQL |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document what the row should contain as well.

| EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_NO_ROWS | No row for empty grouping set | Microsoft SQL Sever family, Oracle |

**Example:**
```sql
-- The following two SQL statements yields a single row with value 0 in the systems DO NOT require this compatibility.
SELECT COUNT(*) FROM T -- [(0)] when T is empty.
SELECT COUNT(*) FROM T GROUP BY GROUNPING SETS (()) -- [] when T is empty in systems requiring this compatibility.
```

=== "AggregateRel Message"

Expand Down