Grouping key selection for aggregating subqueries

# CIR-2017-219

Grouping key selection for aggregating subqueries

## Introduction

openCypher provides way to express aggregation queries by an implicit grouping key, and the [Neo4j 3.1 documentation][neo4j-3.1-aggregation] tells:

> `RETURN n, count(*)`
>
> We have two return expressions: `n`, and `count(*)`. The first, `n`, is not an aggregate function, and so it will be the grouping key. The latter, `count(*)` is an aggregate expression.

As long as aggregate and non-aggregate expressions are clearly separated in the RETURN/WITH expression list, i.e. each comma-separated expression is either an aggregate or a non-aggregate expression, the situation *seems* clear. We have seen an example [demonstrating the problem][ocim1-aggregation-example], and a proposed solution (#188, #218) is to force clear separation of aggregate and non-aggregate expressions.

In this CIR, I'd like to propose a solution that fits in the current syntax and it's semantics is still clear.

## Options

 1. grouping key is the tuple built from all variables (node, relationship, their properties or variables chained from previous subquery) that appear outside of aggregate functions of a particular WITH/RETURN clause
 2. each item of the expression list in WITH/RETURN are forced to contain either `(i)` no aggregate function or a `(ii)` single aggregate function at the outermost level (this is the approach in #188, #218). The grouping key is the tuple built from items of type `(i)`
 
## Discussion

Neither option 1 nor option 2 restricts the expressiveness of the language as the resultset of either approach can be expressed using the other approach by introducing subqueries.

### Option 1: group by variables that appear outside of aggregate functions

The most notable problem with this approach is that it would change behaviour of Neo4j even when aggregating and non-aggregating return items are clearly separated.

On the other hand, grouping on the variables, i.e. the basic building blocks of an expression seems to be flexible yet still clear way in case of a complex expression.

Variables can stand for:

 - nodes
 - relationships
 - properties of nodes or relationships
 - expressions chained from previous subqueries


### Option 2: group by expressions that has no aggregation

The most obvious benefit of this approach is that Neo4j currently works this way if each return item is either aggregating or non-aggregating.

However, counter-intuitive results come if mixed. Posing restrictions on creating complex by mixing aggregating and non-aggregating expressions is a safety net for beginners, but cumbersome for more complex queries.



## Examples

### Sample data

We initialize our database of 10 nodes and no relationships. 5 of them has a single label *Foo*, the other 5 has *Bar*, both partition having a single node for weights of -2, -1, 0, 1, 2.

```
unwind range(1, 5) as i
create (:Foo {id: i, weight: i-3})
;
unwind range(1, 5) as i
create (:Bar {id: 10+i, weight: i-3})
;

match (n)
return labels(n), n
order by n.id
;
```

Which leads to the dataset:

```
╒═══════════╤═════════════════════════╕
│"labels(n)"│"n"                      │
╞═══════════╪═════════════════════════╡
│["Foo"]    │{"weight":"-2","id":"1"} │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"-1","id":"2"} │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"0","id":"3"}  │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"1","id":"4"}  │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"2","id":"5"}  │
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"-2","id":"11"}│
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"-1","id":"12"}│
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"0","id":"13"} │
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"1","id":"14"} │
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"2","id":"15"} │
└───────────┴─────────────────────────┘
```

### Query1a: retrieving a variable and an aggregate: a clear situation

```
match (n)
return n.weight, count(*)
```

Both approaces should return the same as the non-aggregating return item is a single variable:

```
╒══════════╤══════════╕
│"n.weight"│"count(*)"│
╞══════════╪══════════╡
│"2"       │"2"       │
├──────────┼──────────┤
│"1"       │"2"       │
├──────────┼──────────┤
│"-2"      │"2"       │
├──────────┼──────────┤
│"0"       │"2"       │
├──────────┼──────────┤
│"-1"      │"2"       │
└──────────┴──────────┘
```

### Query1b: retrieving a variable and a computation based on aggregate

```
match (n)
return n.weight as w, 2*count(*) as c2
```

This is still clear, however, using option 2, we have to rewrite this as:

```
match (n)
with n.weight as w, count(*) as c
return w, 2*c as c2
```

### Query2: aggregating and non-aggregating return items separated, retrieving result of non-aggregating calculation

```
match (n)
return abs(n.weight) as abs, count(*) as count
```

#### Option 2

```
╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2"  │"4"    │
├─────┼───────┤
│"1"  │"4"    │
├─────┼───────┤
│"0"  │"2"    │
└─────┴───────┘
```

When using Option 1, this result can be expressed like

```
match (n)
with abs(n.weight) as abs, n
return abs, count(*) as count
```

#### Option 1

```
╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2"  │"2"    │
├─────┼───────┤
│"1"  │"2"    │
├─────┼───────┤
│"2"  │"2"    │
├─────┼───────┤
│"0"  │"2"    │
├─────┼───────┤
│"1"  │"2"    │
└─────┴───────┘
```

When using semantics of Option 2, this result can be achieved like

```
match (n)
with n.weight as w, count(*) as c
return abs(w) as abs, c as count
```

### Query 3: returning complex, single item

```
match (n: Foo), (m:Bar)
where n.weight = - m.weight
return n.weight+m.weight+count(*) as expr
     , collect(n) as nc, collect(m) as mc
```

#### Option 2

Can't handle such a query as non-aggregating and aggregating expressions are mixed in 1st return item.

#### Option 1

Option 1 should return:

```
╒══════╤══════════════════════════╤═══════════════════════════╕
│"expr"│"nc"                      │"mc"                       │
╞══════╪══════════════════════════╪═══════════════════════════╡
│"1"   │[{"weight":"2","id":"5"}] │[{"weight":"-2","id":"11"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"1","id":"4"}] │[{"weight":"-1","id":"12"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"-1","id":"2"}]│[{"weight":"1","id":"14"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"0","id":"3"}] │[{"weight":"0","id":"13"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"-2","id":"1"}]│[{"weight":"2","id":"15"}] │
└──────┴──────────────────────────┴───────────────────────────┘
```

This result expressed when using syntax and semantics of option 2:

```
match (n: Foo), (m:Bar)
where n.weight = - m.weight
with n.weight as nw, m.weight as mw
   , count(*) as c
   , collect(n) as nc, collect(m) as mc
return nw+mw+c as expr, nc, mc
```

[neo4j-3.1-aggregation]: https://neo4j.com/docs/developer-manual/3.1/cypher/functions/aggregating/
[ocim1-aggregation-example]: https://groups.google.com/forum/#!topic/opencypher/fpk9DMZ3aeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping key selection for aggregating subqueries #219

CIR-2017-219

Introduction

Options

Discussion

Option 1: group by variables that appear outside of aggregate functions

Option 2: group by expressions that has no aggregation

Examples

Sample data

Query1a: retrieving a variable and an aggregate: a clear situation

Query1b: retrieving a variable and a computation based on aggregate

Query2: aggregating and non-aggregating return items separated, retrieving result of non-aggregating calculation

Option 2

Option 1

Query 3: returning complex, single item

Option 2

Option 1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Grouping key selection for aggregating subqueries #219

Description

CIR-2017-219

Introduction

Options

Discussion

Option 1: group by variables that appear outside of aggregate functions

Option 2: group by expressions that has no aggregation

Examples

Sample data

Query1a: retrieving a variable and an aggregate: a clear situation

Query1b: retrieving a variable and a computation based on aggregate

Query2: aggregating and non-aggregating return items separated, retrieving result of non-aggregating calculation

Option 2

Option 1

Query 3: returning complex, single item

Option 2

Option 1

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions