[FEA] Support `min_by` and `max_by` aggregations in reduction and groupby

**Is your feature request related to a problem? Please describe.**
Apache Spark has `min_by` and `max_by` groupby aggregations that find row of a `value` column based on the min/max row of an `ordering` column. In order to implement such aggregations, currently we combine the input into a structs column (`ordering` stands before `value`) and find the min/max struct element. However, this operation (finding min/max of struct element) is not supported in hash-based groupby aggregation thus it is running in the sort-based aggregation code path, which can be much slower especially for queries having many aggregations.

**Describe the solution you'd like**
When we tried to implement the native `min_by` and `max_by` aggregations in cudf (https://github.com/rapidsai/cudf/pull/16163), we were suggested that the more efficient approach is:
 * Finding arg_min/arg_max of the `ordering` column, then
 * Gather the `value` column using arg_min/arg_max indices

However, after investing significant effort in the Spark-Rapids plugin, we still couldn't implement such approach due to the architecture of our plugin that cannot connect the input of the aggregation to the output of the aggregation for the gathering step. As such, we still need to support the native `min_by`/`max_by` aggregation in cudf.


**Describe alternatives you've considered**
The implementation was proposed in https://github.com/rapidsai/cudf/pull/16163, thus we just need to creating a new PR from it, addressing review comments and do some more cleanup before reviewing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support `min_by` and `max_by` aggregations in reduction and groupby #20946

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Support min_by and max_by aggregations in reduction and groupby #20946

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEA] Support `min_by` and `max_by` aggregations in reduction and groupby #20946