-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
I’ve encountered a fallback when filtering on a column using %in% with many values, but only when the data is lazily read from a Parquet file.
Reprex:
suppressMessages({
library(duckplyr)
library(dplyr)
library(arrow)
library(fs)
})
# Create a simple Parquet file
tmp <- dir_create(file_temp())
file <- path(tmp, "test.parquet")
in_memory_duckdb_df <-
tibble(id = 1:1000) |>
compute_parquet(file)
scanned_parquet_df <- read_parquet_duckdb(file, prudence = "stingy")
many_values <- 1:500
in_memory_duckdb_df |> filter(id %in% many_values) # no issues
#> # A duckplyr data frame: 1 variable
#> id
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
#> # ℹ more rows
scanned_parquet_df |> filter(id %in% many_values)
#> Error in `filter()`:
#> ! This operation cannot be carried out by DuckDB, and the input is a
#> stingy duckplyr frame.
#> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage and
#> continue with duckplyr.
#> ℹ See `vignette("prudence")` for other options.
#> Caused by error:
#> ! Materialization is disabled, use collect() or as_tibble() to materialize.
scanned_parquet_df |> filter(id %in% many_values[1:10]) |> explain()
#> ┌───────────────────────────┐
#> │ READ_PARQUET │
#> │ ──────────────────── │
#> │ Function: │
#> │ READ_PARQUET │
#> │ │
#> │ Projections: id │
#> │ │
#> │ Filters: │
#> │ COALESCE(((((((((("r_base:│
#> │ :=="(id, 1) OR "r_base::==│
#> │ "(id, 2)) OR "r_base::==" │
#> │ (id, 3)) OR "r_base::==" │
#> │ (id, 4)) OR "r_base::==" │
#> │ (id, 5)) OR "r_base::==" │
#> │ (id, 6)) OR "r_base::==" │
#> │ (id, 7)) OR "r_base::==" │
#> │ (id, 8)) OR "r_base::==" │
#> │ (id, 9)) OR "r_base::==" │
#> │ (id, 10)), false) │
#> │ │
#> │ ~200 Rows │
#> └───────────────────────────┘Created on 2025-08-05 with reprex v2.1.1
As you can see in the reprex output, when filtering on a smaller number of values, the translation results in a collection of OR == statements which might be the root cause. I'm wondering why simply using the "IN" operator wouldn't do the job better?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels