-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem? Please describe.
Apache Spark has min_by and max_by groupby aggregations that find row of a value column based on the min/max row of an ordering column. In order to implement such aggregations, currently we combine the input into a structs column (ordering stands before value) and find the min/max struct element. However, this operation (finding min/max of struct element) is not supported in hash-based groupby aggregation thus it is running in the sort-based aggregation code path, which can be much slower especially for queries having many aggregations.
Describe the solution you'd like
When we tried to implement the native min_by and max_by aggregations in cudf (#16163), we were suggested that the more efficient approach is:
- Finding arg_min/arg_max of the
orderingcolumn, then - Gather the
valuecolumn using arg_min/arg_max indices
However, after investing significant effort in the Spark-Rapids plugin, we still couldn't implement such approach due to the architecture of our plugin that cannot connect the input of the aggregation to the output of the aggregation for the gathering step. As such, we still need to support the native min_by/max_by aggregation in cudf.
Describe alternatives you've considered
The implementation was proposed in #16163, thus we just need to creating a new PR from it, addressing review comments and do some more cleanup before reviewing.