-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
Description
Describe the bug, including details regarding any error messages, version, and platform.
If there's an arrange(.) in the lazy pipeline that is followed by some aggregation with summarize, the collection still looks for the sorting column:
library(arrow)
library(dplyr)
arrow_table(mtcars) |>
summarize(across(mpg, list(Min = min, Max = max))) |>
collect()
# # A tibble: 1 × 2
# mpg_Min mpg_Max
# <dbl> <dbl>
# 1 10.4 33.9
arrow_table(mtcars) |>
arrange(mpg) |>
summarize(across(mpg, list(Min = min, Max = max))) |>
collect()
# Error in compute.arrow_dplyr_query(x) :
# Invalid: Invalid sort key column: No match for FieldRef.Name(mpg) in mpg_Min: double
# mpg_Max: double
# ----
# mpg_Min:
# [
# [
# 10.4
# ]
# ]
# mpg_Max:
# [
# [
# 33.9
# ]
# ]This example is somewhat contrived here, in that this summarization does not need ordered data. The underlying issue remains: why does it not sort the data at that point and then summarize? I'm not certain if this is a problem with lazy sorting or if it is too aggressive preserving the sort-field(s).
This behavior is in contrast to a selection removing the sorting column:
arrow_table(mtcars) |>
arrange(mpg) |>
select(-mpg) |>
collect()
# # A tibble: 32 × 10
# cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 8 472 205 2.93 5.25 18.0 0 0 3 4
# 2 8 460 215 3 5.42 17.8 0 0 3 4
# 3 8 350 245 3.73 3.84 15.4 0 0 3 4
# 4 8 360 245 3.21 3.57 15.8 0 0 3 4
# 5 8 440 230 3.23 5.34 17.4 0 0 3 4
# 6 8 301 335 3.54 3.57 14.6 0 1 5 8
# 7 8 276. 180 3.07 3.78 18 0 0 3 3
# 8 8 304 150 3.15 3.44 17.3 0 0 3 2
# 9 8 318 150 2.76 3.52 16.9 0 0 3 2
# 10 8 351 264 4.22 3.17 14.5 0 1 5 4
# # ℹ 22 more rows
# # ℹ Use `print(n = ...)` to see more rows > sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_18.1.0.1 dplyr_1.1.4
loaded via a namespace (and not attached):
[1] assertthat_0.2.1 utf8_1.2.4 R6_2.5.1 bit_4.5.0.1 tidyselect_1.2.1 magrittr_2.0.3 glue_1.8.0 tibble_3.2.1 pkgconfig_2.0.3 bit64_4.5.2
[11] generics_0.1.3 lifecycle_1.0.4 cli_3.6.3 fansi_1.0.6 vctrs_0.6.5 withr_3.0.2 compiler_4.4.2 purrr_1.0.2 pillar_1.9.0 rlang_1.1.4 Component(s)
R