Skip to content

Grouping key selection for aggregating subqueries #219

@jmarton

Description

@jmarton

CIR-2017-219

Grouping key selection for aggregating subqueries

Introduction

openCypher provides way to express aggregation queries by an implicit grouping key, and the Neo4j 3.1 documentation tells:

RETURN n, count(*)

We have two return expressions: n, and count(*). The first, n, is not an aggregate function, and so it will be the grouping key. The latter, count(*) is an aggregate expression.

As long as aggregate and non-aggregate expressions are clearly separated in the RETURN/WITH expression list, i.e. each comma-separated expression is either an aggregate or a non-aggregate expression, the situation seems clear. We have seen an example demonstrating the problem, and a proposed solution (#188, #218) is to force clear separation of aggregate and non-aggregate expressions.

In this CIR, I'd like to propose a solution that fits in the current syntax and it's semantics is still clear.

Options

  1. grouping key is the tuple built from all variables (node, relationship, their properties or variables chained from previous subquery) that appear outside of aggregate functions of a particular WITH/RETURN clause
  2. each item of the expression list in WITH/RETURN are forced to contain either (i) no aggregate function or a (ii) single aggregate function at the outermost level (this is the approach in Syntax differentiation for aggregations #188, Aggregations #218). The grouping key is the tuple built from items of type (i)

Discussion

Neither option 1 nor option 2 restricts the expressiveness of the language as the resultset of either approach can be expressed using the other approach by introducing subqueries.

Option 1: group by variables that appear outside of aggregate functions

The most notable problem with this approach is that it would change behaviour of Neo4j even when aggregating and non-aggregating return items are clearly separated.

On the other hand, grouping on the variables, i.e. the basic building blocks of an expression seems to be flexible yet still clear way in case of a complex expression.

Variables can stand for:

  • nodes
  • relationships
  • properties of nodes or relationships
  • expressions chained from previous subqueries

Option 2: group by expressions that has no aggregation

The most obvious benefit of this approach is that Neo4j currently works this way if each return item is either aggregating or non-aggregating.

However, counter-intuitive results come if mixed. Posing restrictions on creating complex by mixing aggregating and non-aggregating expressions is a safety net for beginners, but cumbersome for more complex queries.

Examples

Sample data

We initialize our database of 10 nodes and no relationships. 5 of them has a single label Foo, the other 5 has Bar, both partition having a single node for weights of -2, -1, 0, 1, 2.

unwind range(1, 5) as i
create (:Foo {id: i, weight: i-3})
;
unwind range(1, 5) as i
create (:Bar {id: 10+i, weight: i-3})
;

match (n)
return labels(n), n
order by n.id
;

Which leads to the dataset:

╒═══════════╤═════════════════════════╕
│"labels(n)"│"n"                      │
╞═══════════╪═════════════════════════╡
│["Foo"]    │{"weight":"-2","id":"1"} │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"-1","id":"2"} │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"0","id":"3"}  │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"1","id":"4"}  │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"2","id":"5"}  │
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"-2","id":"11"}│
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"-1","id":"12"}│
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"0","id":"13"} │
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"1","id":"14"} │
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"2","id":"15"} │
└───────────┴─────────────────────────┘

Query1a: retrieving a variable and an aggregate: a clear situation

match (n)
return n.weight, count(*)

Both approaces should return the same as the non-aggregating return item is a single variable:

╒══════════╤══════════╕
│"n.weight"│"count(*)"│
╞══════════╪══════════╡
│"2"       │"2"       │
├──────────┼──────────┤
│"1"       │"2"       │
├──────────┼──────────┤
│"-2"      │"2"       │
├──────────┼──────────┤
│"0"       │"2"       │
├──────────┼──────────┤
│"-1"      │"2"       │
└──────────┴──────────┘

Query1b: retrieving a variable and a computation based on aggregate

match (n)
return n.weight as w, 2*count(*) as c2

This is still clear, however, using option 2, we have to rewrite this as:

match (n)
with n.weight as w, count(*) as c
return w, 2*c as c2

Query2: aggregating and non-aggregating return items separated, retrieving result of non-aggregating calculation

match (n)
return abs(n.weight) as abs, count(*) as count

Option 2

╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2"  │"4"    │
├─────┼───────┤
│"1"  │"4"    │
├─────┼───────┤
│"0"  │"2"    │
└─────┴───────┘

When using Option 1, this result can be expressed like

match (n)
with abs(n.weight) as abs, n
return abs, count(*) as count

Option 1

╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2"  │"2"    │
├─────┼───────┤
│"1"  │"2"    │
├─────┼───────┤
│"2"  │"2"    │
├─────┼───────┤
│"0"  │"2"    │
├─────┼───────┤
│"1"  │"2"    │
└─────┴───────┘

When using semantics of Option 2, this result can be achieved like

match (n)
with n.weight as w, count(*) as c
return abs(w) as abs, c as count

Query 3: returning complex, single item

match (n: Foo), (m:Bar)
where n.weight = - m.weight
return n.weight+m.weight+count(*) as expr
     , collect(n) as nc, collect(m) as mc

Option 2

Can't handle such a query as non-aggregating and aggregating expressions are mixed in 1st return item.

Option 1

Option 1 should return:

╒══════╤══════════════════════════╤═══════════════════════════╕
│"expr"│"nc"                      │"mc"                       │
╞══════╪══════════════════════════╪═══════════════════════════╡
│"1"   │[{"weight":"2","id":"5"}] │[{"weight":"-2","id":"11"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"1","id":"4"}] │[{"weight":"-1","id":"12"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"-1","id":"2"}]│[{"weight":"1","id":"14"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"0","id":"3"}] │[{"weight":"0","id":"13"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"-2","id":"1"}]│[{"weight":"2","id":"15"}] │
└──────┴──────────────────────────┴───────────────────────────┘

This result expressed when using syntax and semantics of option 2:

match (n: Foo), (m:Bar)
where n.weight = - m.weight
with n.weight as nw, m.weight as mw
   , count(*) as c
   , collect(n) as nc, collect(m) as mc
return nw+mw+c as expr, nc, mc

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions