Skip to content

dt[, (cols) := list(...), by = group] should not silently recycles listΒ #4022

@renkun-ken

Description

@renkun-ken

Currently, dt[, (cols) := list(...), by = group] seems to silently recycles list(...) when replacing values of cols. If length(list) < length(cols), then list is recyled; if length(list) > length(cols) then redundant elements in list are silently dropped, as demonstrated below:

When by = group is absent, the lengths are checked:

library(data.table)
dt <- data.table(id = 1:10)
xn <- 1:3
xcols <- paste0("x", xn)
dt[, (xcols) := list(10, 20)]
#> Error in `[.data.table`(dt, , `:=`((xcols), list(10, 20))): Supplied 3 columns to be assigned 2 items. Please see NEWS for v1.12.2.

However, if by = group is used, list is recycled:

library(data.table)
dt <- data.table(id = 1:10)
dt[, group := sample(1:2, .N, replace = TRUE)]
xn <- 1:3
xcols <- paste0("x", xn)
dt[, (xcols) := list(10, 20), by = group]
dt
#>     id group x1 x2 x3
#>  1:  1     2 10 20 10
#>  2:  2     2 10 20 10
#>  3:  3     2 10 20 10
#>  4:  4     2 10 20 10
#>  5:  5     2 10 20 10
#>  6:  6     2 10 20 10
#>  7:  7     1 10 20 10
#>  8:  8     2 10 20 10
#>  9:  9     2 10 20 10
#> 10: 10     2 10 20 10
library(data.table)
dt <- data.table(id = 1:10)
dt[, group := sample(1:2, .N, replace = TRUE)]
xn <- 1:3
xcols <- paste0("x", xn)
dt[, (xcols) := list(40, 30, 20, 10), by = group]
dt
#>     id group x1 x2 x3
#>  1:  1     1 40 30 20
#>  2:  2     1 40 30 20
#>  3:  3     2 40 30 20
#>  4:  4     2 40 30 20
#>  5:  5     2 40 30 20
#>  6:  6     1 40 30 20
#>  7:  7     1 40 30 20
#>  8:  8     2 40 30 20
#>  9:  9     2 40 30 20
#> 10: 10     1 40 30 20

Personally, the recycling behavior is almost always unwanted. If it occurs, it is mostly something wrong with my code.

Consider the following example where list(...) is produced by lapply(.SD, ...). If the function is inlined and a bit complicated, one often forgets to write .SDcols.

library(data.table)
set.seed(123)
dt <- data.table(id = 1:10)
dt[, group := sample(1:2, .N, replace = TRUE)]
xn <- 1:3
xcols <- paste0("x", xn)
for (i in xn) {
  dt[, xcols[[i]] := runif(.N)]
}
dt[, (xcols) := lapply(.SD, function(x) {
  x / sd(x)
}), by = group]
dt
#>     id group        x1        x2        x3
#>  1:  1     1 0.2672612 2.7645427 3.2041655
#>  2:  2     1 0.5345225 1.3098014 2.4955128
#>  3:  3     1 0.8017837 1.9576795 2.3071378
#>  4:  4     2 2.3421602 1.5214175 4.9351189
#>  5:  5     1 1.3363062 0.2973764 2.3618854
#>  6:  6     2 3.5132403 2.3907258 3.5168344
#>  7:  7     2 4.0987803 0.6538253 2.7005050
#>  8:  8     2 4.6843204 0.1117471 2.9490603
#>  9:  9     1 2.4053512 0.9474491 1.0415680
#> 10: 10     1 2.6726124 2.7578116 0.5299108

Undesired/incorrect results are silently produced. The following are the correct results with .SDcols added.

library(data.table)
set.seed(123)
dt <- data.table(id = 1:10)
dt[, group := sample(1:2, .N, replace = TRUE)]
xn <- 1:3
xcols <- paste0("x", xn)
for (i in xn) {
  dt[, xcols[[i]] := runif(.N)]
}
dt[, (xcols) := lapply(.SD, function(x) {
  x / sd(x)
}), by = group, .SDcols = xcols]
dt
#>     id group        x1        x2        x3
#>  1:  1     1 2.7645427 3.2041655 2.5018371
#>  2:  2     1 1.3098014 2.4955128 2.3440794
#>  3:  3     1 1.9576795 2.3071378 1.7943807
#>  4:  4     2 1.5214175 4.9351189 2.9399476
#>  5:  5     1 0.2973764 2.3618854 0.0639438
#>  6:  6     2 2.3907258 3.5168344 1.7658739
#>  7:  7     2 0.6538253 2.7005050 2.8031711
#>  8:  8     2 0.1117471 2.9490603 0.7998165
#>  9:  9     1 0.9474491 1.0415680 0.8266013
#> 10: 10     1 2.7578116 0.5299108 0.6017398

I suggest that list(...) recycling should be consistent with the behavior data.table has already adopted with row recycling: only accepting zero, one, or .N elements.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions