Skip to content

[R] summarize after arrange fails #45373

@r2evans

Description

@r2evans

Describe the bug, including details regarding any error messages, version, and platform.

If there's an arrange(.) in the lazy pipeline that is followed by some aggregation with summarize, the collection still looks for the sorting column:

library(arrow)
library(dplyr)
arrow_table(mtcars) |>
  summarize(across(mpg, list(Min = min, Max = max))) |>
  collect()
# # A tibble: 1 × 2
#   mpg_Min mpg_Max
#     <dbl>   <dbl>
# 1    10.4    33.9

arrow_table(mtcars) |>
  arrange(mpg) |>
  summarize(across(mpg, list(Min = min, Max = max))) |>
  collect()
# Error in compute.arrow_dplyr_query(x) : 
#   Invalid: Invalid sort key column: No match for FieldRef.Name(mpg) in mpg_Min: double
# mpg_Max: double
# ----
# mpg_Min:
#   [
#     [
#       10.4
#     ]
#   ]
# mpg_Max:
#   [
#     [
#       33.9
#     ]
#   ]

This example is somewhat contrived here, in that this summarization does not need ordered data. The underlying issue remains: why does it not sort the data at that point and then summarize? I'm not certain if this is a problem with lazy sorting or if it is too aggressive preserving the sort-field(s).

This behavior is in contrast to a selection removing the sorting column:

arrow_table(mtcars) |>
  arrange(mpg) |>
  select(-mpg) |>
  collect()
# # A tibble: 32 × 10
#      cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1     8  472    205  2.93  5.25  18.0     0     0     3     4
#  2     8  460    215  3     5.42  17.8     0     0     3     4
#  3     8  350    245  3.73  3.84  15.4     0     0     3     4
#  4     8  360    245  3.21  3.57  15.8     0     0     3     4
#  5     8  440    230  3.23  5.34  17.4     0     0     3     4
#  6     8  301    335  3.54  3.57  14.6     0     1     5     8
#  7     8  276.   180  3.07  3.78  18       0     0     3     3
#  8     8  304    150  3.15  3.44  17.3     0     0     3     2
#  9     8  318    150  2.76  3.52  16.9     0     0     3     2
# 10     8  351    264  4.22  3.17  14.5     0     1     5     4
# # ℹ 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_18.1.0.1 dplyr_1.1.4   

loaded via a namespace (and not attached):
 [1] assertthat_0.2.1 utf8_1.2.4       R6_2.5.1         bit_4.5.0.1      tidyselect_1.2.1 magrittr_2.0.3   glue_1.8.0       tibble_3.2.1     pkgconfig_2.0.3  bit64_4.5.2     
[11] generics_0.1.3   lifecycle_1.0.4  cli_3.6.3        fansi_1.0.6      vctrs_0.6.5      withr_3.0.2      compiler_4.4.2   purrr_1.0.2      pillar_1.9.0     rlang_1.1.4     

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions