Skip to content

Commit 74a87d3

Browse files
authored
Merge branch 'master' into DataCodeIntegration
2 parents e692fc5 + 475599a commit 74a87d3

22 files changed

+1809
-36
lines changed

.ci/.lintr.R

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ linters = c(dt_linters, all_linters(
2121
message = "Use messagef to avoid fragmented translations.",
2222
warning = "Use warningf to avoid fragmented translations.",
2323
stop = "Use stopf to avoid fragmented translations.",
24+
rev = "Use frev internally, or setfrev if by-reference is safe.",
2425
NULL
2526
)),
2627
# undesirable_function_linter(modify_defaults(

NAMESPACE

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ export(nafill)
5959
export(setnafill)
6060
export(.Last.updated)
6161
export(fcoalesce)
62+
export(mergelist, setmergelist)
6263
export(cbindlist, setcbindlist)
6364
export(substitute2)
6465
#export(DT) # mtcars |> DT(i,j,by) #4872 #5472
@@ -208,6 +209,7 @@ S3method(format_list_item, data.frame)
208209

209210
export(fdroplevels, setdroplevels)
210211
S3method(droplevels, data.table)
212+
export(frev)
211213

212214
# sort_by added in R 4.4.0, #6662, https://stat.ethz.ch/pipermail/r-announce/2024/000701.html
213215
if (getRversion() >= "4.4.0") S3method(sort_by, data.table)

NEWS.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@
44

55
## data.table [v1.17.99](https://github.com/Rdatatable/data.table/milestone/35) (in development)
66

7+
### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES
8+
9+
1. `data.table(x=1, <expr>)`, where `<expr>` is an expression resulting in a 1-column matrix without column names, will eventually have names `x` and `V2`, not `x` and `V1`, consistent with `data.table(x=1, <expr>)` where `<expr>` results in an atomic vector, for example `data.table(x=1, cbind(1))` and `data.table(x=1, 1)` will both have columns named `x` and `V2`. In this release, the matrix case continues to be named `V1`, but the new behavior can be activated by setting `options(datatable.old.matrix.autoname)` to `FALSE`. See point 5 under Bug Fixes for more context; this change will provide more internal consistency as well as more consistency with `data.frame()`.
10+
711
### NEW FEATURES
812

913
1. New `sort_by()` method for data.tables, [#6662](https://github.com/Rdatatable/data.table/issues/6662). It uses `forder()` to improve upon the data.frame method and also match `DT[order(...)]` behavior with respect to locale. Thanks @rikivillalba for the suggestion and PR.
@@ -46,6 +50,12 @@
4650

4751
10. `data.table()` and `as.data.table()` with `keep.rownames=TRUE` now extract row names from named vectors, matching `data.frame()` behavior. Names from the first named vector in the input are used to create the row names column (default name `"rn"` or custom name via `keep.rownames="column_name"`), [#1916](https://github.com/Rdatatable/data.table/issues/1916). Thanks to @richierocks for the feature request and @Mukulyadav2004 for the implementation.
4852

53+
11. New `frev(x)` as a faster analogue to `base::rev()` for atomic vectors/lists, [#5885](https://github.com/Rdatatable/data.table/issues/5885). Twice as fast as `base::rev()` on large inputs, and faster with more threads. Thanks to Benjamin Schwendinger for suggesting and implementing.
54+
55+
12. New `cbindlist()` and `setcbindlist()` for concatenating a `list` of data.tables column-wise, evocative of the analogous `do.call(rbind, l)` <-> `rbindlist(l)`, [#2576](https://github.com/Rdatatable/data.table/issues/2576). `setcbindlist()` does so without making any copies. Thanks @MichaelChirico for the FR, @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.
56+
57+
13. New `mergelist()` and `setmergelist()` similarly work _a la_ `Reduce()` to recursively merge a `list` of data.tables, [#599](https://github.com/Rdatatable/data.table/issues/599). Different join modes (_left_, _inner_, _full_, _right_, _semi_, _anti_, and _cross_) are supported through the `how` argument; duplicate handling goes through the `mult` argument. `setmergelist()` carefully avoids copies where one is not needed, e.g. in a 1:1 left join. Thanks Patrick Nicholson for the FR (in 2013!), @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.
58+
4959
### BUG FIXES
5060

5161
1. `fread()` no longer warns on certain systems on R 4.5.0+ where the file owner can't be resolved, [#6918](https://github.com/Rdatatable/data.table/issues/6918). Thanks @ProfFancyPants for the report and PR.
@@ -56,7 +66,7 @@
5666
5767
4. In rare cases, `data.table` failed to expand ALTREP columns when assigning a full column by reference. This could result in the target column getting modified unintentionally if the next call to the data.table was a modification by reference of the source column. E.g. in `DT[, b := as.character(a)]` the string conversion gets deferred and subsequent modification of column `a` would also modify column `b`, [#5400](https://github.com/Rdatatable/data.table/issues/5400). Thanks to @aquasync for the report and Václav Tlapák for the PR.
5868
59-
5. `data.table()` function is now more aligned with `data.frame()` with respect to the names of the output when one of its inputs is a single-column matrix object, [#4124](https://github.com/Rdatatable/data.table/issues/4124). Thanks @PavoDive for the report, @jangorecki for the PR, and @MichaelChirico for a follow-up for back-compatibility.
69+
5. `data.table()` function is now more aligned with `data.frame()` with respect to the names of the output when one of its inputs is a single-column matrix object, [#4124](https://github.com/Rdatatable/data.table/issues/4124), [#3193](https://github.com/Rdatatable/data.table/issues/3193), and [#5367](https://github.com/Rdatatable/data.table/issues/5367). Thanks @PavoDive for the report, @jangorecki for the PR, and @MichaelChirico for a follow-up for back-compatibility.
6070
6171
6. Including an `ITime` object as a named input to `data.frame()` respects the provided name, i.e. `data.frame(a = as.ITime(...))` will have column `a`, [#4673](https://github.com/Rdatatable/data.table/issues/4673). Thanks @shrektan for the report and @MichaelChirico for the fix.
6272

R/as.data.table.R

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,11 @@ as.data.table.table = function(x, keep.rownames=FALSE, key=NULL, ...) {
3636
# prevent #4179 & just cut out here
3737
if (any(dim(x) == 0L)) return(null.data.table())
3838
# Fix for bug #43 - order of columns are different when doing as.data.table(with(DT, table(x, y)))
39-
val = rev(dimnames(provideDimnames(x)))
39+
val = frev(dimnames(provideDimnames(x)))
4040
if (is.null(names(val)) || !any(nzchar(names(val))))
41-
setattr(val, 'names', paste0("V", rev(seq_along(val))))
41+
setattr(val, 'names', paste0("V", frev(seq_along(val))))
4242
ans = data.table(do.call(CJ, c(val, sorted=FALSE)), N = as.vector(x), key=key)
43-
setcolorder(ans, c(rev(head(names(ans), -1L)), "N"))
43+
setcolorder(ans, c(frev(head(names(ans), -1L)), "N"))
4444
ans
4545
}
4646

@@ -50,7 +50,7 @@ as.data.table.matrix = function(x, keep.rownames=FALSE, key=NULL, ...) {
5050
ans = data.table(rn=rownames(x), x, keep.rownames=FALSE)
5151
# auto-inferred name 'x' is not back-compatible & inconsistent, #7145
5252
if (ncol(x) == 1L && is.null(colnames(x)))
53-
setnames(ans, 'x', 'V1')
53+
setnames(ans, 'x', 'V1', skip_absent=TRUE)
5454
if (is.character(keep.rownames))
5555
setnames(ans, 'rn', keep.rownames[1L])
5656
return(ans)
@@ -104,18 +104,18 @@ as.data.table.array = function(x, keep.rownames=FALSE, key=NULL, sorted=TRUE, va
104104
dnx[nulldnx] = lapply(dx[nulldnx], seq_len) #3636
105105
dnx
106106
} else dnx
107-
val = rev(val)
107+
setfrev(val)
108108
if (is.null(names(val)) || !any(nzchar(names(val))))
109-
setattr(val, 'names', paste0("V", rev(seq_along(val))))
109+
setattr(val, 'names', paste0("V", frev(seq_along(val))))
110110
if (value.name %chin% names(val))
111-
stopf("Argument 'value.name' should not overlap with column names in result: %s", brackify(rev(names(val))))
111+
stopf("Argument 'value.name' should not overlap with column names in result: %s", brackify(frev(names(val))))
112112
N = NULL
113113
ans = do.call(CJ, c(val, sorted=FALSE))
114114
set(ans, j="N", value=as.vector(x))
115115
if (isTRUE(na.rm))
116116
ans = ans[!is.na(N)]
117117
setnames(ans, "N", value.name)
118-
dims = rev(head(names(ans), -1L))
118+
dims = frev(head(names(ans), -1L))
119119
setcolorder(ans, c(dims, value.name))
120120
if (isTRUE(sorted) && is.null(key)) key = dims
121121
setkeyv(ans, key)
@@ -162,7 +162,7 @@ as.data.table.list = function(x,
162162
xi = x[[i]] = as.POSIXct(xi)
163163
} else if (is.matrix(xi) || is.data.frame(xi)) {
164164
if (!is.data.table(xi)) {
165-
if (is.matrix(xi) && NCOL(xi)<=1L && is.null(colnames(xi))) { # 1 column matrix naming #4124
165+
if (is.matrix(xi) && NCOL(xi)==1L && is.null(colnames(xi)) && isFALSE(getOption('datatable.old.matrix.autoname'))) { # 1 column matrix naming #4124
166166
xi = x[[i]] = c(xi)
167167
} else {
168168
xi = x[[i]] = as.data.table(xi, keep.rownames=keep.rownames) # we will never allow a matrix to be a column; always unpack the columns

R/bmerge.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
110110
}
111111
if (x_merge_type=="integer64" || i_merge_type=="integer64") {
112112
nm = c(iname, xname)
113-
if (x_merge_type=="integer64") { w=i; wc=icol; wclass=i_merge_type; } else { w=x; wc=xcol; wclass=x_merge_type; nm=rev(nm) } # w is which to coerce
113+
if (x_merge_type=="integer64") { w=i; wc=icol; wclass=i_merge_type; } else { w=x; wc=xcol; wclass=x_merge_type; setfrev(nm) } # w is which to coerce
114114
if (wclass=="integer" || (wclass=="double" && fitsInInt64(w[[wc]]))) {
115115
from_detail = if (wclass == "double") gettext(" (which has integer64 representation, e.g. no fractions)") else ""
116116
coerce_col(w, wc, wclass, "integer64", nm[1L], nm[2L], from_detail, verbose=verbose)

R/cedta.R

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
11

2-
cedta.override = NULL # If no need arises, will deprecate.
3-
42
cedta.pkgEvalsUserCode = c("gWidgetsWWW","statET","FastRWeb","slidify","rmarkdown","knitr","ezknitr","IRkernel", "rtvs")
53
# These packages run user code in their own environment and thus do not
64
# themselves Depend or Import data.table. knitr's eval is passed envir=globalenv() so doesn't
@@ -72,7 +70,6 @@ cedta = function(n=2L) {
7270
(all(c("FUN", "X") %chin% ls(parent.frame(n))) ||
7371
.any_sd_queries_in_stack(sc))) ||
7472
(nsname %chin% cedta.pkgEvalsUserCode && .any_eval_calls_in_stack()) ||
75-
nsname %chin% cedta.override ||
7673
isTRUE(ns$.datatable.aware) || # As of Sep 2018: RCAS, caretEnsemble, dtplyr, rstanarm, rbokeh, CEMiTool, rqdatatable, RImmPort, BPRMeth, rlist
7774
tryCatch("data.table" %chin% get(".Depends",paste("package",nsname,sep=":"),inherits=FALSE),error=function(e)FALSE) # both ns$.Depends and get(.Depends,ns) are not sufficient
7875
if (!ans && getOption("datatable.verbose")) {

R/data.table.R

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ replace_dot_alias = function(e) {
221221
}
222222
return(x)
223223
}
224-
if (!mult %chin% c("first", "last", "all")) stopf("mult argument can only be 'first', 'last' or 'all'")
224+
if (!mult %chin% c("first", "last", "all", "error")) stopf("mult argument can only be 'first', 'last', 'all' or 'error'")
225225
missingroll = missing(roll)
226226
if (length(roll)!=1L || is.na(roll)) stopf("roll must be a single TRUE, FALSE, positive/negative integer/double including +Inf and -Inf or 'nearest'")
227227
if (is.character(roll)) {
@@ -520,6 +520,7 @@ replace_dot_alias = function(e) {
520520
}
521521
i = .shallow(i, retain.key = TRUE)
522522
ans = bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, ops, verbose=verbose)
523+
if (mult == "error") mult = "all" ## error should have been raised inside bmerge() call above already, if it wasn't continue as mult="all"
523524
xo = ans$xo ## to make it available for further use.
524525
# temp fix for issue spotted by Jan, test #1653.1. TODO: avoid this
525526
# 'setorder', as there's another 'setorder' in generating 'irows' below...

R/mergelist.R

Lines changed: 102 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ cbindlist_impl_ = function(l, copy) {
99
}
1010

1111
cbindlist = function(l) cbindlist_impl_(l, copy=TRUE)
12-
setcbindlist = function(l) cbindlist_impl_(l, copy=FALSE)
12+
setcbindlist = function(l) invisible(cbindlist_impl_(l, copy=FALSE))
1313

1414
# when 'on' is missing then use keys, used only for inner and full join
1515
onkeys = function(x, y) {
@@ -157,9 +157,9 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
157157
stopf("'on' is missing and necessary key is not present")
158158
}
159159
if (any(bad.on <- !on %chin% names(lhs)))
160-
stopf("'on' argument specifies columns to join [%s] that are not present in %s table [%s]", brackify(on[bad.on]), "LHS", brackify(names(lhs)))
160+
stopf("'on' argument specifies columns to join %s that are not present in %s table %s", brackify(on[bad.on]), "LHS", brackify(names(lhs)))
161161
if (any(bad.on <- !on %chin% names(rhs)))
162-
stopf("'on' argument specifies columns to join [%s] that are not present in %s table [%s]", brackify(on[bad.on]), "RHS", brackify(names(rhs)))
162+
stopf("'on' argument specifies columns to join %s that are not present in %s table %s", brackify(on[bad.on]), "RHS", brackify(names(rhs)))
163163
} else if (is.null(on)) {
164164
on = character() ## cross join only
165165
}
@@ -203,7 +203,7 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
203203
copy_x = TRUE
204204
## ensure no duplicated column names in merge results
205205
if (any(dup.i <- names(out.i) %chin% names(out.x)))
206-
stopf("merge result has duplicated column names [%s], use 'cols' argument or rename columns in 'l' tables", brackify(names(out.i)[dup.i]))
206+
stopf("merge result has duplicated column names %s, use 'cols' argument or rename columns in 'l' tables", brackify(names(out.i)[dup.i]))
207207
}
208208

209209
## stack i and x
@@ -257,6 +257,104 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
257257
setDT(out)
258258
}
259259

260+
mergelist_impl_ = function(l, on, cols, how, mult, join.many, copy) {
261+
verbose = getOption("datatable.verbose")
262+
if (verbose)
263+
p = proc.time()[[3L]]
264+
265+
if (!is.list(l) || is.data.frame(l))
266+
stopf("'%s' must be a list", "l")
267+
if (!all(vapply_1b(l, is.data.table)))
268+
stopf("Every element of 'l' list must be data.table objects")
269+
if (!all(idx <- lengths(l) > 0L))
270+
stopf("Tables in 'l' must all have columns, but these entries have 0: %s", brackify(which(!idx)))
271+
if (any(idx <- vapply_1i(l, function(x) anyDuplicated(names(x))) > 0L))
272+
stopf("Column names in individual 'l' entries must be unique, but these have some duplicates: %s", brackify(which(idx)))
273+
274+
n = length(l)
275+
if (n < 2L) {
276+
out = if (n) l[[1L]] else as.data.table(l)
277+
if (copy) out = copy(out)
278+
if (verbose)
279+
catf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p)
280+
return(out)
281+
}
282+
283+
if (!is.list(join.many))
284+
join.many = rep(list(join.many), n - 1L)
285+
if (length(join.many) != n - 1L || !all(vapply_1b(join.many, isTRUEorFALSE)))
286+
stopf("'join.many' must be TRUE or FALSE, or a list of such whose length must be length(l)-1L")
287+
288+
if (missing(mult))
289+
mult = NULL
290+
if (!is.list(mult))
291+
mult = rep(list(mult), n - 1L)
292+
if (length(mult) != n - 1L || !all(vapply_1b(mult, function(x) is.null(x) || (is.character(x) && length(x) == 1L && !anyNA(x) && x %chin% c("error", "all", "first", "last")))))
293+
stopf("'mult' must be one of [error, all, first, last] or NULL, or a list of such whose length must be length(l)-1L")
294+
295+
if (!is.list(how))
296+
how = rep(list(how), n-1L)
297+
if (length(how)!=n-1L || !all(vapply_1b(how, function(x) is.character(x) && length(x)==1L && !anyNA(x) && x %chin% c("left", "inner", "full", "right", "semi", "anti", "cross"))))
298+
stopf("'how' must be one of [left, inner, full, right, semi, anti, cross], or a list of such whose length must be length(l)-1L")
299+
300+
if (is.null(cols)) {
301+
cols = vector("list", n)
302+
} else {
303+
if (!is.list(cols))
304+
stopf("'%s' must be a list", "cols")
305+
if (length(cols) != n)
306+
stopf("'cols' must be same length as 'l' (%d != %d)", length(cols), n)
307+
skip = vapply_1b(cols, is.null)
308+
if (!all(vapply_1b(cols[!skip], function(x) is.character(x) && !anyNA(x) && !anyDuplicated(x))))
309+
stopf("'cols' must be a list of non-zero length, non-NA, non-duplicated, character vectors, or eventually NULLs (all columns)")
310+
if (any(mapply(function(x, icols) !all(icols %chin% names(x)), l[!skip], cols[!skip])))
311+
stopf("'cols' specify columns not present in corresponding table")
312+
}
313+
314+
if (missing(on) || is.null(on)) {
315+
on = vector("list", n - 1L)
316+
} else {
317+
if (!is.list(on))
318+
on = rep(list(on), n - 1L)
319+
if (length(on) != n-1L || !all(vapply_1b(on, function(x) is.character(x) && !anyNA(x) && !anyDuplicated(x)))) ## length checked in dtmerge
320+
stopf("'on' must be non-NA, non-duplicated, character vector, or a list of such which length must be length(l)-1L")
321+
}
322+
323+
l.mem = lapply(l, vapply, address, "")
324+
out = l[[1L]]
325+
out.cols = cols[[1L]]
326+
for (join.i in seq_len(n - 1L)) {
327+
rhs.i = join.i + 1L
328+
out = mergepair(
329+
lhs = out, rhs = l[[rhs.i]],
330+
on = on[[join.i]],
331+
how = how[[join.i]], mult = mult[[join.i]],
332+
lhs.cols = out.cols, rhs.cols = cols[[rhs.i]],
333+
copy = FALSE, ## avoid any copies inside, will copy once below
334+
join.many = join.many[[join.i]],
335+
verbose = verbose
336+
)
337+
out.cols = copy(names(out))
338+
}
339+
out.mem = vapply_1c(out, address)
340+
if (copy)
341+
.Call(CcopyCols, out, colnamesInt(out, names(out.mem)[out.mem %chin% unique(unlist(l.mem, recursive=FALSE))]))
342+
if (verbose)
343+
catf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]] - p)
344+
out
345+
}
346+
347+
mergelist = function(l, on, cols=NULL, how=c("left", "inner", "full", "right", "semi", "anti", "cross"), mult, join.many=getOption("datatable.join.many")) {
348+
if (missing(how) || is.null(how))
349+
how = match.arg(how)
350+
mergelist_impl_(l, on, cols, how, mult, join.many, copy=TRUE)
351+
}
352+
setmergelist = function(l, on, cols=NULL, how=c("left", "inner", "full", "right", "semi", "anti", "cross"), mult, join.many=getOption("datatable.join.many")) {
353+
if (missing(how) || is.null(how))
354+
how = match.arg(how)
355+
invisible(mergelist_impl_(l, on, cols, how, mult, join.many, copy=FALSE))
356+
}
357+
260358
# Previously, we had a custom C implementation here, which is ~2x faster,
261359
# but this is fast enough we don't bother maintaining a new routine.
262360
# Hopefully in the future rep() can recognize the ALTREP and use that, too.

R/onLoad.R

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,8 @@
7373
# In fread and fwrite we have moved back to using getOption's default argument since it is unlikely fread and fread will be called in a loop many times, plus they
7474
# are relatively heavy functions where the overhead in getOption() would not be noticed. It's only really [.data.table where getOption default bit.
7575
# Improvement to base::getOption() now submitted (100x; 5s down to 0.05s): https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17394
76-
opts = c("datatable.verbose"="FALSE", # datatable.<argument name>
76+
opts = c(
77+
"datatable.verbose"="FALSE", # datatable.<argument name>
7778
"datatable.optimize"="Inf", # datatable.<argument name>
7879
"datatable.print.nrows"="100L", # datatable.<argument name>
7980
"datatable.print.topn"="5L", # datatable.<argument name>
@@ -85,12 +86,14 @@
8586
"datatable.show.indices"="FALSE", # for print.data.table
8687
"datatable.allow.cartesian"="FALSE", # datatable.<argument name>
8788
"datatable.join.many"="TRUE", # mergelist, [.data.table #4383 #914
88-
"datatable.dfdispatchwarn"="TRUE", # not a function argument
89-
"datatable.warnredundantby"="TRUE", # not a function argument
89+
"datatable.dfdispatchwarn"="TRUE", # not a function argument
90+
"datatable.warnredundantby"="TRUE", # not a function argument
9091
"datatable.alloccol"="1024L", # argument 'n' of alloc.col. Over-allocate 1024 spare column slots
9192
"datatable.auto.index"="TRUE", # DT[col=="val"] to auto add index so 2nd time faster
9293
"datatable.use.index"="TRUE", # global switch to address #1422
93-
"datatable.prettyprint.char" = NULL # FR #1091
94+
"datatable.prettyprint.char" = NULL, # FR #1091
95+
"datatable.old.matrix.autoname"="TRUE", # #7145: how data.table(x=1, matrix(1)) is auto-named set to change
96+
NULL
9497
)
9598
for (i in setdiff(names(opts),names(options()))) {
9699
eval(parse(text=paste0("options(",i,"=",opts[i],")")))

0 commit comments

Comments
 (0)