Skip to content

Commit c646d54

Browse files
Merge branch 'master' into frev
2 parents bbbff7e + 46816e8 commit c646d54

38 files changed

+1489
-222
lines changed

NAMESPACE

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,10 @@ export(nafill)
5959
export(setnafill)
6060
export(.Last.updated)
6161
export(fcoalesce)
62+
export(cbindlist, setcbindlist)
6263
export(substitute2)
6364
#export(DT) # mtcars |> DT(i,j,by) #4872 #5472
65+
export(fctr)
6466

6567
S3method("[", data.table)
6668
S3method("[<-", data.table)

NEWS.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,31 @@
1818

1919
6. `between()` gains the argument `ignore_tzone=FALSE`. Normally, a difference in time zone between `lower` and `upper` will produce an error, and a difference in time zone between `x` and either of the others will produce a message. Setting `ignore_tzone=TRUE` bypasses the checks, allowing both comparisons to proceed without error or message about time zones.
2020

21-
7. New `frev(x)` as a faster analogue to `base::rev()` for atomic vectors/lists, [#5885](https://github.com/Rdatatable/data.table/issues/5885). Twice as fast as `base::rev()` on large inputs, and faster with more threads. Thanks to Benjamin Schwendinger for suggesting and implementing.
21+
7. New helper function `fctr` as an extended version of `factor()`, [#4837](https://github.com/Rdatatable/data.table/issues/4837). Most notably, it supports (1) retaining input level ordering by default, i.e. `levels=unique(x)` as opposed to `levels = sort(unique(x))`; (2) `rev=` to reverse the levels; and (3) `sort=` to allow more feature parity with `factor()`. The choice of default is motivated by convenience in the common case when order of elements needs be preserved, for example when using `dcast` or adding a legend to a plot. This also matches the default sort ordering of groups in `by=`.
22+
23+
```r
24+
d = data.table(id1=rep(1:2, each=3L), id2=letters[c(4:3,5L,3:5)], v1=1:6)
25+
dcast(d, id1 ~ factor(id2))
26+
# id1 c d e
27+
# 1: 1 2 1 3
28+
# 2: 2 4 5 6
29+
dcast(d, id1 ~ fctr(id2))
30+
# id1 d c e
31+
# 1: 1 1 2 3
32+
# 2: 2 5 4 6
33+
dcast(d, id1 ~ fctr(id2, sort=TRUE)) # same as factor()
34+
# id1 c d e
35+
# 1: 1 2 1 3
36+
# 2: 2 4 5 6
37+
dcast(d, id1 ~ fctr(id2, rev=TRUE))
38+
# id1 e c d
39+
# 1: 1 3 2 1
40+
# 2: 2 6 4 5
41+
```
42+
43+
8. `groupingsets()` gets a new argument `enclos` for use together with the `jj` argument in functions wrapping `groupingsets()`, including the existing wrappers `rollup()` and `cube()`. When forwarding a `j`-expression as `groupingsets(jj = substitute(j))`, make sure to pass `enclos = parent.frame()` as well, so that the `j`-expression will be evaluated in the right context. This makes it possible for `j` to refer to variables outside the `data.table`.
44+
45+
9. New `frev(x)` as a faster analogue to `base::rev()` for atomic vectors/lists, [#5885](https://github.com/Rdatatable/data.table/issues/5885). Twice as fast as `base::rev()` on large inputs, and faster with more threads. Thanks to Benjamin Schwendinger for suggesting and implementing.
2246

2347
### BUG FIXES
2448

@@ -46,6 +70,18 @@
4670
4771
12. Internal functions used to signal errors are now marked as non-returning, silencing a compiler warning about potentially unchecked allocation failure. Thanks to Prof. Brian D. Ripley for the report and @aitap for the fix, [#7070](https://github.com/Rdatatable/data.table/pull/7070).
4872
73+
13. In rare cases, `data.table` failed to expand ALTREP columns when assigning a full column by reference. This could result in the target column getting modified unintentionally if the next call to the data.table was a modification by reference of the source column. E.g. in `DT[, b := as.character(a)]` the string conversion gets deferred and subsequent modification of column `a` would also modify column `b`, [#5400](https://github.com/Rdatatable/data.table/issues/5400). Thanks to @aquasync for the report and Václav Tlapák for the PR.
74+
75+
14. `data.table()` function is now more aligned with `data.frame()` with respect to the names of the output when one of its inputs is a single-column matrix object, [#4124](https://github.com/Rdatatable/data.table/issues/4124). Thanks @PavoDive for the report and @jangorecki for the PR.
76+
77+
15. Including an `ITime` object as a named input to `data.frame()` respects the provided name, i.e. `data.frame(a = as.ITime(...))` will have column `a`, [#4673](https://github.com/Rdatatable/data.table/issues/4673). Thanks @shrektan for the report and @MichaelChirico for the fix.
78+
79+
16. `fread()` now handles the `na.strings` argument for quoted text columns, making it possible to specify `na.strings = '""'` and read empty quoted strings as `NA`s, [#6974](https://github.com/Rdatatable/data.table/issues/6974). Thanks to @AngelFelizR for the report and @aitap for the PR.
80+
81+
17. A data.table with a column of class `vctrs_list_of` (from package {vctrs}) prints as expected, [#5948](https://github.com/Rdatatable/data.table/issues/5948). Before, they could be printed messily, e.g. printing every entry in a nested data.frame. Thanks @jesse-smith for the report, @DavisVaughan and @r2evans for contributing, and @MichaelChirico for the PR.
82+
83+
18. Fixed incorrect sorting of merges where the first column of a key is a factor with non-`sort()`-ed levels (e.g. `factor(1:2, 2:1)` and it is joined to a character column, [#5361](https://github.com/Rdatatable/data.table/issues/5361). Thanks to @gbrunick for the report and Benjamin Schwendinger for the fix.
84+
4985
### NOTES
5086
5187
1. Continued work to remove non-API C functions, [#6180](https://github.com/Rdatatable/data.table/issues/6180). Thanks Ivan Krylov for the PRs and for writing a clear and concise guide about the R API: https://aitap.codeberg.page/R-api/.

R/IDateTime.R

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,7 @@ as.character.ITime = format.ITime = function(x, ...) {
209209
res
210210
}
211211

212-
as.data.frame.ITime = function(x, ...) {
212+
as.data.frame.ITime = function(x, ..., optional=FALSE) {
213213
# This method is just for ggplot2, #1713
214214
# Avoids the error "cannot coerce class '"ITime"' into a data.frame", but for some reason
215215
# ggplot2 doesn't seem to call the print method to get axis labels, so still prints integers.
@@ -219,7 +219,8 @@ as.data.frame.ITime = function(x, ...) {
219219
# ans = list(as.POSIXct(x,tzone="")) # ggplot2 gives "Error: Discrete value supplied to continuous scale"
220220
setattr(ans, "class", "data.frame")
221221
setattr(ans, "row.names", .set_row_names(length(x)))
222-
setattr(ans, "names", "V1")
222+
# require 'optional' support for passing back to e.g. data.frame() without overriding names there
223+
if (!optional) setattr(ans, "names", "V1")
223224
ans
224225
}
225226

R/as.data.table.R

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,11 @@ as.data.table.list = function(x,
142142
xi = x[[i]] = as.POSIXct(xi)
143143
} else if (is.matrix(xi) || is.data.frame(xi)) {
144144
if (!is.data.table(xi)) {
145-
xi = x[[i]] = as.data.table(xi, keep.rownames=keep.rownames) # we will never allow a matrix to be a column; always unpack the columns
145+
if (is.matrix(xi) && NCOL(xi)<=1L && is.null(colnames(xi))) { # 1 column matrix naming #4124
146+
xi = x[[i]] = c(xi)
147+
} else {
148+
xi = x[[i]] = as.data.table(xi, keep.rownames=keep.rownames) # we will never allow a matrix to be a column; always unpack the columns
149+
}
146150
}
147151
# else avoid dispatching to as.data.table.data.table (which exists and copies)
148152
} else if (is.table(xi)) {

R/bmerge.R

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,13 @@ coerce_col = function(dt, col, from_type, to_type, from_name, to_name, from_deta
2727

2828
bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbose)
2929
{
30+
if (roll != 0.0 && length(icols)) {
31+
last_x_idx = tail(xcols, 1L)
32+
last_i_idx = tail(icols, 1L)
33+
if (is.factor(x[[last_x_idx]]) || is.factor(i[[last_i_idx]]))
34+
stopf("Attempting roll join on factor column when joining x.%s to i.%s. Only integer, double or character columns may be roll joined.", names(x)[last_x_idx], names(i)[last_i_idx])
35+
}
36+
3037
callersi = i
3138
i = shallow(i)
3239
# Just before the call to bmerge() in [.data.table there is a shallow() copy of i to prevent coercions here
@@ -64,9 +71,8 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
6471
iname = paste0("i.", names(i)[icol])
6572
if (!x_merge_type %chin% supported) stopf("%s is type %s which is not supported by data.table join", xname, x_merge_type)
6673
if (!i_merge_type %chin% supported) stopf("%s is type %s which is not supported by data.table join", iname, i_merge_type)
74+
# we check factors first because they might have different levels
6775
if (x_merge_type=="factor" || i_merge_type=="factor") {
68-
if (roll!=0.0 && a==length(icols))
69-
stopf("Attempting roll join on factor column when joining %s to %s. Only integer, double or character columns may be roll joined.", xname, iname)
7076
if (x_merge_type=="factor" && i_merge_type=="factor") {
7177
if (verbose) catf("Matching %s factor levels to %s factor levels.\n", iname, xname)
7278
set(i, j=icol, value=chmatch(levels(i[[icol]]), levels(x[[xcol]]), nomatch=0L)[i[[icol]]]) # nomatch=0L otherwise a level that is missing would match to NA values
@@ -86,7 +92,6 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
8692
}
8793
stopf("Incompatible join types: %s (%s) and %s (%s). Factor columns must join to factor or character columns.", xname, x_merge_type, iname, i_merge_type)
8894
}
89-
# we check factors first to cater for the case when trying to do rolling joins on factors
9095
if (x_merge_type == i_merge_type) {
9196
if (verbose) catf("%s has same type (%s) as %s. No coercion needed.\n", iname, x_merge_type, xname)
9297
next

R/data.table.R

Lines changed: 64 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ replace_dot_alias = function(e) {
221221
}
222222
return(x)
223223
}
224-
if (!mult %chin% c("first","last","all")) stopf("mult argument can only be 'first', 'last' or 'all'")
224+
if (!mult %chin% c("first", "last", "all")) stopf("mult argument can only be 'first', 'last' or 'all'")
225225
missingroll = missing(roll)
226226
if (length(roll)!=1L || is.na(roll)) stopf("roll must be a single TRUE, FALSE, positive/negative integer/double including +Inf and -Inf or 'nearest'")
227227
if (is.character(roll)) {
@@ -542,12 +542,25 @@ replace_dot_alias = function(e) {
542542
# Really, `anyDuplicated` in base is AWESOME!
543543
# allow.cartesian shouldn't error if a) not-join, b) 'i' has no duplicates
544544
if (verbose) {last.started.at=proc.time();catf("Constructing irows for '!byjoin || nqbyjoin' ... ");flush.console()}
545-
irows = if (allLen1) f__ else vecseq(f__,len__,
546-
if (allow.cartesian ||
547-
notjoin || # #698. When notjoin=TRUE, ignore allow.cartesian. Rows in answer will never be > nrow(x).
548-
!anyDuplicated(f__, incomparables = c(0L, NA_integer_))) {
549-
NULL # #742. If 'i' has no duplicates, ignore
550-
} else as.double(nrow(x)+nrow(i))) # rows in i might not match to x so old max(nrow(x),nrow(i)) wasn't enough. But this limit now only applies when there are duplicates present so the reason now for nrow(x)+nrow(i) is just to nail it down and be bigger than max(nrow(x),nrow(i)).
545+
if (allLen1) {
546+
irows = f__
547+
} else {
548+
join.many = isTRUE(getOption("datatable.join.many")) # #914, default TRUE for backward compatibility
549+
anyDups = !notjoin &&
550+
(
551+
# #698. When notjoin=TRUE, ignore allow.cartesian. Rows in answer will never be > nrow(x).
552+
(join.many && !allow.cartesian) ||
553+
# special case of scalar i match to const duplicated x, not handled by anyDuplicate: data.table(x=c(1L,1L))[data.table(x=1L), on="x"]
554+
(!join.many && (length(f__) != 1L || len__ != nrow(x)))
555+
) &&
556+
anyDuplicated(f__, incomparables = c(0L, NA_integer_)) > 0L
557+
limit = if (anyDups) { # #742. If 'i' has no duplicates, ignore
558+
if (!join.many) stopf("Joining resulted in many-to-many join. Perform quality check on your data, use mult!='all', or set 'datatable.join.many' option to TRUE to allow rows explosion.")
559+
if (allow.cartesian) internal_error("checking allow.cartesian and join.many, unexpected else branch reached") # nocov
560+
as.double(nrow(x)+nrow(i)) # rows in i might not match to x so old max(nrow(x),nrow(i)) wasn't enough. But this limit now only applies when there are duplicates present so the reason now for nrow(x)+nrow(i) is just to nail it down and be bigger than max(nrow(x),nrow(i)).
561+
}
562+
irows = vecseq(f__, len__, limit)
563+
}
551564
if (verbose) {cat(timetaken(last.started.at),"\n"); flush.console()}
552565
# Fix for #1092 and #1074
553566
# TODO: implement better version of "any"/"all"/"which" to avoid
@@ -1335,21 +1348,8 @@ replace_dot_alias = function(e) {
13351348
ans[icolsAns] = .Call(CsubsetDT, i, ii, icols)
13361349
ans[xcolsAns] = .Call(CsubsetDT, x, irows, xcols)
13371350
setattr(ans, "names", ansvars)
1338-
if (haskey(x)) {
1339-
keylen = which.first(!key(x) %chin% ansvars)-1L
1340-
if (is.na(keylen)) keylen = length(key(x))
1341-
len = length(rightcols)
1342-
# fix for #1268, #1704, #1766 and #1823
1343-
chk = if (len && !missing(on)) !identical(head(key(x), len), names(on)) else FALSE
1344-
if ( (keylen>len || chk) && !.Call(CisOrderedSubset, irows, nrow(x))) {
1345-
keylen = if (!chk) len else 0L # fix for #1268
1346-
}
1347-
## check key on i as well!
1348-
ichk = is.data.table(i) && haskey(i) &&
1349-
identical(head(key(i), length(leftcols)), names_i[leftcols]) # i has the correct key, #3061
1350-
if (keylen && (ichk || is.logical(i) || (.Call(CisOrderedSubset, irows, nrow(x)) && ((roll == FALSE) || length(irows) == 1L)))) # see #1010. don't set key when i has no key, but irows is ordered and roll != FALSE
1351-
setattr(ans,"sorted",head(key(x),keylen))
1352-
}
1351+
# NB: could be NULL
1352+
setattr(ans, "sorted", .join_result_key(x, i, ans, if (!missing(on)) names(on), ansvars, leftcols, rightcols, names_i, irows, roll))
13531353
setattr(ans, "class", class(x)) # retain class that inherits from data.table, #64
13541354
setattr(ans, "row.names", .set_row_names(length(ans[[1L]])))
13551355
setalloccol(ans)
@@ -2021,6 +2021,48 @@ replace_dot_alias = function(e) {
20212021
setalloccol(ans) # TODO: overallocate in dogroups in the first place and remove this line
20222022
}
20232023

2024+
# can the specified merge of x and i be marked as sorted? return the columns for which this is true, otherwise NULL
2025+
.join_result_key <- function(x, i, ans, on_lhs, ansvars, leftcols, rightcols, names_i, irows, roll) {
2026+
x_key <- key(x)
2027+
if (is.null(x_key))
2028+
return(NULL)
2029+
2030+
key_length = which.first(!x_key %chin% ansvars) - 1L
2031+
if (is.na(key_length))
2032+
key_length = length(x_key)
2033+
2034+
rhs_length = length(rightcols)
2035+
# fix for #1268, #1704, #1766 and #1823
2036+
chk = rhs_length && !is.null(on_lhs) && !identical(head(x_key, rhs_length), on_lhs)
2037+
if ( (key_length > rhs_length || chk) && !.Call(CisOrderedSubset, irows, nrow(x))) {
2038+
key_length = if (chk) 0L else rhs_length # fix for #1268
2039+
}
2040+
2041+
if (!key_length)
2042+
return(NULL)
2043+
2044+
# i has the correct key, #3061
2045+
if (identical(head(key(i), length(leftcols)), names_i[leftcols]))
2046+
return(head(x_key, key_length))
2047+
2048+
if (!.Call(CisOrderedSubset, irows, nrow(x)))
2049+
return(NULL)
2050+
2051+
# see #1010. don't set key when i has no key, but irows is ordered and !roll
2052+
if (roll && length(irows) != 1L)
2053+
return(NULL)
2054+
2055+
new_key <- head(x_key, key_length)
2056+
2057+
#5361 merging on keyed factor with character, check if resulting character is really sorted
2058+
if (identical(vapply_1c(.shallow(i, leftcols), typeof), vapply_1c(.shallow(x, rightcols), typeof)))
2059+
return(new_key)
2060+
2061+
if (!is.sorted(ans, by=new_key))
2062+
return(NULL)
2063+
new_key
2064+
}
2065+
20242066
# What's the name of the top-level call in 'j'?
20252067
# NB: earlier, we used 'as.character()' but that fails for closures/builtins (#6026).
20262068
root_name = function(jsub) if (is.call(jsub)) paste(deparse(jsub[[1L]]), collapse = " ") else ""

R/fread.R

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -70,9 +70,13 @@ yaml=FALSE, tmpdir=tempdir(), tz="UTC")
7070
}
7171
}
7272
if (!is.null(cmd)) {
73-
(if (.Platform$OS.type == "unix") system else shell)(paste0('(', cmd, ') > ', tmpFile<-tempfile(tmpdir=tmpdir)))
74-
file = tmpFile
73+
tmpFile = tempfile(tmpdir=tmpdir)
7574
on.exit(unlink(tmpFile), add=TRUE)
75+
status = (if (.Platform$OS.type == "unix") system else shell)(paste0('(', cmd, ') > ', tmpFile))
76+
if (status != 0) {
77+
stopf("External command failed with exit code %d. This can happen when the disk is full in the temporary directory ('%s'). See ?fread for the tmpdir argument.", status, tmpdir)
78+
}
79+
file = tmpFile
7680
}
7781
if (!is.null(file)) {
7882
if (!is.character(file) || length(file)!=1L)
@@ -116,9 +120,14 @@ yaml=FALSE, tmpdir=tempdir(), tz="UTC")
116120
if (!requireNamespace("R.utils", quietly = TRUE))
117121
stopf("To read %s files directly, fread() requires 'R.utils' package which cannot be found. Please install 'R.utils' using 'install.packages('R.utils')'.", if (w<=2L || gzsig) "gz" else "bz2") # nocov
118122
FUN = if (w<=2L || gzsig) gzfile else bzfile
119-
R.utils::decompressFile(file, decompFile<-tempfile(tmpdir=tmpdir), ext=NULL, FUN=FUN, remove=FALSE) # ext is not used by decompressFile when destname is supplied, but isn't optional
120-
file = decompFile # don't use 'tmpFile' symbol again, as tmpFile might be the http://domain.org/file.csv.gz download
123+
decompFile = tempfile(tmpdir=tmpdir)
121124
on.exit(unlink(decompFile), add=TRUE)
125+
tryCatch({
126+
R.utils::decompressFile(file, decompFile, ext=NULL, FUN=FUN, remove=FALSE) # ext is not used by decompressFile when destname is supplied, but isn't optional
127+
}, error = function(e) {
128+
stopf("R.utils::decompressFile failed to decompress file '%s':\n %s\n. This can happen when the disk is full in the temporary directory ('%s'). See ?fread for the tmpdir argument.", file, conditionMessage(e), tmpdir)
129+
})
130+
file = decompFile # don't use 'tmpFile' symbol again, as tmpFile might be the http://domain.org/file.csv.gz download
122131
}
123132
file = enc2native(file) # CfreadR cannot handle UTF-8 if that is not the native encoding, see #3078.
124133

0 commit comments

Comments
 (0)