-
Notifications
You must be signed in to change notification settings - Fork 157
Description
CIR-2017-219
Grouping key selection for aggregating subqueries
Introduction
openCypher provides way to express aggregation queries by an implicit grouping key, and the Neo4j 3.1 documentation tells:
RETURN n, count(*)We have two return expressions:
n, andcount(*). The first,n, is not an aggregate function, and so it will be the grouping key. The latter,count(*)is an aggregate expression.
As long as aggregate and non-aggregate expressions are clearly separated in the RETURN/WITH expression list, i.e. each comma-separated expression is either an aggregate or a non-aggregate expression, the situation seems clear. We have seen an example demonstrating the problem, and a proposed solution (#188, #218) is to force clear separation of aggregate and non-aggregate expressions.
In this CIR, I'd like to propose a solution that fits in the current syntax and it's semantics is still clear.
Options
- grouping key is the tuple built from all variables (node, relationship, their properties or variables chained from previous subquery) that appear outside of aggregate functions of a particular WITH/RETURN clause
- each item of the expression list in WITH/RETURN are forced to contain either
(i)no aggregate function or a(ii)single aggregate function at the outermost level (this is the approach in Syntax differentiation for aggregations #188, Aggregations #218). The grouping key is the tuple built from items of type(i)
Discussion
Neither option 1 nor option 2 restricts the expressiveness of the language as the resultset of either approach can be expressed using the other approach by introducing subqueries.
Option 1: group by variables that appear outside of aggregate functions
The most notable problem with this approach is that it would change behaviour of Neo4j even when aggregating and non-aggregating return items are clearly separated.
On the other hand, grouping on the variables, i.e. the basic building blocks of an expression seems to be flexible yet still clear way in case of a complex expression.
Variables can stand for:
- nodes
- relationships
- properties of nodes or relationships
- expressions chained from previous subqueries
Option 2: group by expressions that has no aggregation
The most obvious benefit of this approach is that Neo4j currently works this way if each return item is either aggregating or non-aggregating.
However, counter-intuitive results come if mixed. Posing restrictions on creating complex by mixing aggregating and non-aggregating expressions is a safety net for beginners, but cumbersome for more complex queries.
Examples
Sample data
We initialize our database of 10 nodes and no relationships. 5 of them has a single label Foo, the other 5 has Bar, both partition having a single node for weights of -2, -1, 0, 1, 2.
unwind range(1, 5) as i
create (:Foo {id: i, weight: i-3})
;
unwind range(1, 5) as i
create (:Bar {id: 10+i, weight: i-3})
;
match (n)
return labels(n), n
order by n.id
;
Which leads to the dataset:
╒═══════════╤═════════════════════════╕
│"labels(n)"│"n" │
╞═══════════╪═════════════════════════╡
│["Foo"] │{"weight":"-2","id":"1"} │
├───────────┼─────────────────────────┤
│["Foo"] │{"weight":"-1","id":"2"} │
├───────────┼─────────────────────────┤
│["Foo"] │{"weight":"0","id":"3"} │
├───────────┼─────────────────────────┤
│["Foo"] │{"weight":"1","id":"4"} │
├───────────┼─────────────────────────┤
│["Foo"] │{"weight":"2","id":"5"} │
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"-2","id":"11"}│
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"-1","id":"12"}│
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"0","id":"13"} │
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"1","id":"14"} │
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"2","id":"15"} │
└───────────┴─────────────────────────┘
Query1a: retrieving a variable and an aggregate: a clear situation
match (n)
return n.weight, count(*)
Both approaces should return the same as the non-aggregating return item is a single variable:
╒══════════╤══════════╕
│"n.weight"│"count(*)"│
╞══════════╪══════════╡
│"2" │"2" │
├──────────┼──────────┤
│"1" │"2" │
├──────────┼──────────┤
│"-2" │"2" │
├──────────┼──────────┤
│"0" │"2" │
├──────────┼──────────┤
│"-1" │"2" │
└──────────┴──────────┘
Query1b: retrieving a variable and a computation based on aggregate
match (n)
return n.weight as w, 2*count(*) as c2
This is still clear, however, using option 2, we have to rewrite this as:
match (n)
with n.weight as w, count(*) as c
return w, 2*c as c2
Query2: aggregating and non-aggregating return items separated, retrieving result of non-aggregating calculation
match (n)
return abs(n.weight) as abs, count(*) as count
Option 2
╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2" │"4" │
├─────┼───────┤
│"1" │"4" │
├─────┼───────┤
│"0" │"2" │
└─────┴───────┘
When using Option 1, this result can be expressed like
match (n)
with abs(n.weight) as abs, n
return abs, count(*) as count
Option 1
╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2" │"2" │
├─────┼───────┤
│"1" │"2" │
├─────┼───────┤
│"2" │"2" │
├─────┼───────┤
│"0" │"2" │
├─────┼───────┤
│"1" │"2" │
└─────┴───────┘
When using semantics of Option 2, this result can be achieved like
match (n)
with n.weight as w, count(*) as c
return abs(w) as abs, c as count
Query 3: returning complex, single item
match (n: Foo), (m:Bar)
where n.weight = - m.weight
return n.weight+m.weight+count(*) as expr
, collect(n) as nc, collect(m) as mc
Option 2
Can't handle such a query as non-aggregating and aggregating expressions are mixed in 1st return item.
Option 1
Option 1 should return:
╒══════╤══════════════════════════╤═══════════════════════════╕
│"expr"│"nc" │"mc" │
╞══════╪══════════════════════════╪═══════════════════════════╡
│"1" │[{"weight":"2","id":"5"}] │[{"weight":"-2","id":"11"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1" │[{"weight":"1","id":"4"}] │[{"weight":"-1","id":"12"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1" │[{"weight":"-1","id":"2"}]│[{"weight":"1","id":"14"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1" │[{"weight":"0","id":"3"}] │[{"weight":"0","id":"13"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1" │[{"weight":"-2","id":"1"}]│[{"weight":"2","id":"15"}] │
└──────┴──────────────────────────┴───────────────────────────┘
This result expressed when using syntax and semantics of option 2:
match (n: Foo), (m:Bar)
where n.weight = - m.weight
with n.weight as nw, m.weight as mw
, count(*) as c
, collect(n) as nc, collect(m) as mc
return nw+mw+c as expr, nc, mc