-
Notifications
You must be signed in to change notification settings - Fork 60
Description
Hello,
In short, the multiple = "any" argument to inner_join() is ignored when translating the initial dplyr code to data.table, resulting in incorrect output without any warnings or errors.
Here's a minimal reprex:
library(dtplyr)
library(dplyr)
library(data.table)
a = tibble(barcode_id = c(3, 1))
b = tibble(
gene_id = c("gene1", "gene2", "gene1"),
barcode_id = c(1, 1, 3),
cluster_id = c(2, 2, 5)
)
expected_result = a |>
inner_join(b, by = 'barcode_id', multiple = 'any')
dtplyr_result = a |>
data.table() |>
lazy_dt() |>
inner_join(b, by = 'barcode_id', multiple = 'any') |>
as_tibble()
Here's expected_result:
barcode_id gene_id cluster_id
<dbl> <chr> <dbl>
1 3 gene1 5
2 1 gene1 2
In contrast, the actual output dtplyr_result incorrectly keeps all rows in a:
barcode_id gene_id cluster_id
<dbl> <chr> <dbl>
1 1 gene1 2
2 1 gene2 2
3 3 gene1 5
It's totally understandable that the logic behind every single parameter in every single dplyr verb is not yet implemented, but I found it quite concerning that there were no apparent checks in place to warn or error when an unimplemented parameter was detected (i.e. multiple here). The silent failure makes dtplyr output difficult to trust more generally, especially when complex dplyr starting code is used.
Thanks for the development of this package, as it clearly addresses a highly important purpose-- getting performant dplyr code with very little additional effort. I verified I'm using dtplyr 1.3.1 here.
Best,
-Nick