Skip to content

[FEA] Support min_by and max_by aggregations in reduction and groupby #20946

@ttnghia

Description

@ttnghia

Is your feature request related to a problem? Please describe.
Apache Spark has min_by and max_by groupby aggregations that find row of a value column based on the min/max row of an ordering column. In order to implement such aggregations, currently we combine the input into a structs column (ordering stands before value) and find the min/max struct element. However, this operation (finding min/max of struct element) is not supported in hash-based groupby aggregation thus it is running in the sort-based aggregation code path, which can be much slower especially for queries having many aggregations.

Describe the solution you'd like
When we tried to implement the native min_by and max_by aggregations in cudf (#16163), we were suggested that the more efficient approach is:

  • Finding arg_min/arg_max of the ordering column, then
  • Gather the value column using arg_min/arg_max indices

However, after investing significant effort in the Spark-Rapids plugin, we still couldn't implement such approach due to the architecture of our plugin that cannot connect the input of the aggregation to the output of the aggregation for the gathering step. As such, we still need to support the native min_by/max_by aggregation in cudf.

Describe alternatives you've considered
The implementation was proposed in #16163, thus we just need to creating a new PR from it, addressing review comments and do some more cleanup before reviewing.

Metadata

Metadata

Labels

SparkFunctionality that helps Spark RAPIDSfeature requestNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions