Skip to content

[FEA] Reduce overhead when computing compound aggregations in hash-based groupby #20734

@ttnghia

Description

@ttnghia

A compound aggregation is an aggregation that depends on other aggregations. For example, MEAN depends on SUM and COUNT_VALID. As such, when computing compound aggregations, we need to firstly compute the dependent aggregations. However, computing the intermediate results for such dependencies typically involves unnecessary work that can accumulate into a significant overhead if the number of aggregations is large.

For example:

  • For computing MIN/MAX of strings, we firstly compute ARG_MIN/ARG_MAX, producing a gather map to gather the input. However, such ARG_MIN/ARG_MAX aggregations launch kernels to compute the unused null mask and null count for the gather map.
  • Similarly, for computing M2, we firstly compute SUM and SUM_OF_SQUARED. These aggregations also launch kernels to compute the unused null mask and null count for the intermediate sums.

We can do better by avoiding to compute null mask and null count if not necessary. We can easily identify if an aggregation is requested by the user or just needed as an intermediate result for computing other compound aggs, then only compute its null mask/null count in such situations.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions