Skip to content

Commit dddcedb

Browse files
committed
Merge branch 'master' of https://github.com/Rdatatable/data.table into issue_7219
2 parents e2de9c0 + a837dc9 commit dddcedb

File tree

18 files changed

+2343
-798
lines changed

18 files changed

+2343
-798
lines changed

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ S3method(cube, data.table)
5454
S3method(rollup, data.table)
5555
export(frollmean)
5656
export(frollsum)
57+
export(frollmax)
5758
export(frollapply)
5859
export(nafill)
5960
export(setnafill)

NEWS.md

Lines changed: 91 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,22 @@
44

55
## data.table [v1.17.99](https://github.com/Rdatatable/data.table/milestone/35) (in development)
66

7+
### BREAKING CHANGE
8+
9+
1. `dcast()` now errors when `fun.aggregate` returns length != 1 (consistent with documentation), regardless of `fill`, [#6629](https://github.com/Rdatatable/data.table/issues/6629). Previously, when `fill` was not `NULL`, `dcast` warned and returned an undefined result. This change has been planned since 1.16.0 (25 Aug 2024).
10+
11+
2. `melt()` returns an integer column for `variable` when `measure.vars` is a list of length=1, consistent with the documented behavior, [#5209](https://github.com/Rdatatable/data.table/issues/5209). Thanks to @tdhock for reporting. Any users who were relying on this behavior can change `measure.vars=list("col_name")` (output `variable` was column name, now is column index/integer) to `measure.vars="col_name"` (`variable` still is column name). This change has been planned since 1.16.0 (25 Aug 2024).
12+
13+
3. Rolling functions `frollmean` and `frollsum` distinguish `Inf`/`-Inf` from `NA` to match the same rules as base R when `algo="fast"` (previously they were considered the same). If your input into those functions has `Inf` or `-Inf` then you will be affected by this change. As a result, the argument that controls the handling of `NA`s has been renamed from `hasNA` to `has.nf` (_has non-finite_). `hasNA` continues to work with a warning, for now.
14+
```r
15+
## before
16+
frollsum(c(1,2,3,Inf,5,6), 2)
17+
#[1] NA 3 5 NA NA 11
18+
19+
## now
20+
frollsum(c(1,2,3,Inf,5,6), 2)
21+
#[1] NA 3 5 Inf Inf 11
22+
723
### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES
824

925
1. `data.table(x=1, <expr>)`, where `<expr>` is an expression resulting in a 1-column matrix without column names, will eventually have names `x` and `V2`, not `x` and `V1`, consistent with `data.table(x=1, <expr>)` where `<expr>` results in an atomic vector, for example `data.table(x=1, cbind(1))` and `data.table(x=1, 1)` will both have columns named `x` and `V2`. In this release, the matrix case continues to be named `V1`, but the new behavior can be activated by setting `options(datatable.old.matrix.autoname)` to `FALSE`. See point 5 under Bug Fixes for more context; this change will provide more internal consistency as well as more consistency with `data.frame()`.
@@ -63,12 +79,84 @@
6379

6480
12. New `cbindlist()` and `setcbindlist()` for concatenating a `list` of data.tables column-wise, evocative of the analogous `do.call(rbind, l)` <-> `rbindlist(l)`, [#2576](https://github.com/Rdatatable/data.table/issues/2576). `setcbindlist()` does so without making any copies. Thanks @MichaelChirico for the FR, @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.
6581

82+
```r
83+
l = list(
84+
data.table(id = 1:3, a = letters[1:3]),
85+
data.table(b = 4:6, c = 7:9)
86+
)
87+
cbindlist(l)
88+
# id a b c
89+
# 1: 1 a 4 7
90+
# 2: 2 b 5 8
91+
# 3: 3 c 6 9
92+
```
93+
6694
13. New `mergelist()` and `setmergelist()` similarly work _a la_ `Reduce()` to recursively merge a `list` of data.tables, [#599](https://github.com/Rdatatable/data.table/issues/599). Different join modes (_left_, _inner_, _full_, _right_, _semi_, _anti_, and _cross_) are supported through the `how` argument; duplicate handling goes through the `mult` argument. `setmergelist()` carefully avoids copies where one is not needed, e.g. in a 1:1 left join. Thanks Patrick Nicholson for the FR (in 2013!), @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.
6795

96+
```r
97+
l = list(
98+
data.table(id = c(1L, 2L, 3L), x = c("a", "b", "c")),
99+
data.table(id = c(1L, 2L, 4L), y = c("d", "e", "f")),
100+
data.table(id = c(1L, 3L, 4L), z = c("g", "h", "i"))
101+
)
102+
103+
# Recursive inner join
104+
mergelist(l, on = "id", how = "inner")
105+
# id x y z
106+
# 1: 1 a d g
107+
108+
# Recursive left join (the default 'how')
109+
mergelist(l, on = "id", how = "left")
110+
# id x y z
111+
# 1: 1 a d g
112+
# 2: 2 b e <NA>
113+
# 3: 3 c <NA> h
114+
```
115+
68116
14. `fcoalesce()` and `setcoalesce()` gain `nan` argument to control whether `NaN` values should be treated as missing (`nan=NA`, the default) or non-missing (`nan=NaN`), [#4567](https://github.com/Rdatatable/data.table/issues/4567). This provides full compatibility with `nafill()` behavior. Thanks to @ethanbsmith for the feature request and @Mukulyadav2004 for the implementation.
69117

70118
15. New function `isoyear()` has been implemented as a complement to `isoweek()`, returning the ISO 8601 year corresponding to a given date, [#7154](https://github.com/Rdatatable/data.table/issues/7154). Thanks to @ben-schwen and @MichaelChirico for the suggestion and @venom1204 for the implementation.
71119

120+
16. Multiple improvements have been added to rolling functions. Request came from @gpierard who needed left aligned, adaptive, rolling max, [#5438](https://github.com/Rdatatable/data.table/issues/5438). There was no `frollmax` function yet. Adaptive rolling functions did not have support for `align="left"`. `frollapply` did not support `adaptive=TRUE`. Available alternatives were base R `mapply` or self-join using `max` and grouping `by=.EACHI`. As a follow up of his request, the following features have been added:
121+
- new function `frollmax`, applies `max` over a rolling window.
122+
- support for `align="left"` for adaptive rolling function.
123+
- support for `adaptive=TRUE` in `frollapply`.
124+
- `partial` argument to trim window width to available observations rather than returning `NA` whenever window is not complete.
125+
- `give.names` argument that can be used to automatically give the names based on the names of `x` and `n`.
126+
- `frollmean` and `frollsum` no longer treat `Inf` and `-Inf` as `NA`s as it used to be for `algo="fast"` (breaking change).
127+
- `hasNA` argument has been renamed to `has.nf` to convey that it is not only related to `NA/NaN` but other non-finite values (`Inf/-Inf`) as well.
128+
129+
Thanks to @jangorecki for implementation and @MichaelChirico and others for work on splitting into smaller PRs and reviews.
130+
For a comprehensive description about all available features see `?froll` manual.
131+
132+
Adaptive `frollmax` has observed to be around 80 times faster than second fastest solution (data.table self-join using `max` and grouping `by=.EACHI`). Note that important factor in performance is width of the rolling window. Code for the benchmark below has been taken from [this SO answer](https://stackoverflow.com/a/73408459/2490497).
133+
```r
134+
set.seed(108)
135+
setDTthreads(16)
136+
x = data.table(
137+
value = cumsum(rnorm(1e6, 0.1)),
138+
end_window = 1:1e6 + sample(50:500, 1e6, TRUE),
139+
row = 1:1e6
140+
)[, "end_window" := pmin(end_window, .N)
141+
][, "len_window" := end_window-row+1L]
142+
baser = function(x) x[, mapply(function(from, to) max(value[from:to]), row, end_window)]
143+
sj = function(x) x[x, max(value), on=.(row >= row, row <= end_window), by=.EACHI]$V1
144+
frmax = function(x) x[, frollmax(value, len_window, adaptive=TRUE, align="left", has.nf=FALSE)]
145+
frapply = function(x) x[, frollapply(value, len_window, max, adaptive=TRUE, align="left")]
146+
microbenchmark::microbenchmark(
147+
baser(x), sj(x), frmax(x), frapply(x),
148+
times=10, check="identical"
149+
)
150+
#Unit: milliseconds
151+
# expr min lq mean median uq max neval
152+
# baser(x) 3094.88357 3097.84966 3186.74832 3163.58050 3251.66753 3370.33785 10
153+
# sj(x) 2221.55456 2255.12083 2306.61382 2303.47883 2346.70293 2412.62975 10
154+
# frmax(x) 17.45124 24.16809 28.10062 28.58153 32.79802 34.83941 10
155+
# frapply(x) 272.07830 316.47060 366.94771 396.23566 416.06699 421.38701 10
156+
```
157+
158+
As of now, adaptive rolling max has no _on-line_ implementation (`algo="fast"`), it uses a naive approach (`algo="exact"`). Therefore further speed up is still possible if `algo="fast"` gets implemented.
159+
72160
### BUG FIXES
73161

74162
1. `fread()` no longer warns on certain systems on R 4.5.0+ where the file owner can't be resolved, [#6918](https://github.com/Rdatatable/data.table/issues/6918). Thanks @ProfFancyPants for the report and PR.
@@ -107,7 +195,9 @@
107195
108196
18. `fwrite` now allows `dec` to be the same as `sep` for edge cases where only one will be written, e.g. 0-row or 1-column tables. [#7227](https://github.com/Rdatatable/data.table/issues/7227). Thanks @MichaelChirico for the report and @venom1204 for the fix.
109197
110-
19. `rowwiseDT()` now provides a helpful error message when a complex object that is not a list (e.g., a function) is provided as a cell value, instructing the user to wrap it in `list()`. [#7219](https://github.com/Rdatatable/data.table/issues/7219). Thanks @kylebutts for the report and @venom1204 for the fix.
198+
19. Ellipsis elements like `..1` are correctly excluded when searching for variables in "up-a-level" syntax inside `[`, [#5460](https://github.com/Rdatatable/data.table/issues/5460). Thanks @ggrothendieck for the report and @MichaelChirico for the fix.
199+
200+
20. `rowwiseDT()` now provides a helpful error message when a complex object that is not a list (e.g., a function) is provided as a cell value, instructing the user to wrap it in `list()`. [#7219](https://github.com/Rdatatable/data.table/issues/7219). Thanks @kylebutts for the report and @venom1204 for the fix.
111201
112202
### NOTES
113203

R/data.table.R

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -264,7 +264,8 @@ replace_dot_alias = function(e) {
264264
if (!missing(j)) {
265265
jsub = replace_dot_alias(jsub)
266266
root = root_name(jsub)
267-
av = all.vars(jsub)
267+
# exclude '..1' etc. for #5460
268+
av = grepv("^[.][.](?:[.]|[0-9]+)$", all.vars(jsub), invert=TRUE)
268269
all..names = FALSE
269270
if ((.is_withFALSE_range(jsub, x, root, av)) ||
270271
(root %chin% c("-","!") && jsub[[2L]] %iscall% '(' && jsub[[2L]][[2L]] %iscall% ':') || ## x[, !(V8:V10)]
@@ -1297,8 +1298,8 @@ replace_dot_alias = function(e) {
12971298
SDenv = new.env(parent=parent.frame())
12981299

12991300
syms = all.vars(jsub)
1300-
syms = syms[ startsWith(syms, "..") ]
1301-
syms = syms[ substr(syms, 3L, 3L) != "." ] # exclude ellipsis
1301+
syms = syms[startsWith(syms, "..")]
1302+
syms = grepv("^[.][.](?:[.]|[0-9]+)$", syms, invert=TRUE) # exclude ellipsis and '..n' ellipsis elements
13021303
for (sym in syms) {
13031304
if (sym %chin% names_x) {
13041305
# if "..x" exists as column name, use column, for backwards compatibility; e.g. package socialmixr in rev dep checks #2779

R/froll.R

Lines changed: 164 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,173 @@
1-
froll = function(fun, x, n, fill=NA, algo=c("fast", "exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
2-
stopifnot(!missing(fun), is.character(fun), length(fun)==1L, !is.na(fun))
3-
algo = match.arg(algo)
1+
# helpers for partial2adaptive
2+
trimn = function(n, len, align) {
3+
n = min(n, len) ## so frollsum(1:2, 3, partial=TRUE) works
4+
if (align=="right")
5+
c(seq_len(n), rep.int(n, len-n))
6+
else
7+
c(rep.int(n, len-n), rev(seq_len(n)))
8+
}
9+
trimnadaptive = function(n, align) {
10+
if (align=="right")
11+
pmin(n, seq_along(n))
12+
else
13+
pmin(n, rev(seq_along(n)))
14+
}
15+
16+
# partial2adaptive helper function
17+
## tune provided 'n' via partial=TRUE to adaptive=TRUE by prepared adaptive 'n' as shown in ?froll examples
18+
# partial2adaptive(1:4, 2, "right", adaptive=FALSE)
19+
# partial2adaptive(1:4, 2:3, "right", adaptive=FALSE)
20+
# partial2adaptive(list(1:4, 2:5), 2:3, "right", adaptive=FALSE)
21+
# frollsum(1:4, 2, partial=FALSE, adaptive=FALSE)
22+
# frollsum(1:4, 2, partial=TRUE, adaptive=FALSE)
23+
# frollsum(1:4, 2:3, partial=FALSE, adaptive=FALSE)
24+
# frollsum(1:4, 2:3, partial=TRUE, adaptive=FALSE)
25+
# frollsum(list(1:4, 2:5), 2:3, partial=FALSE, adaptive=FALSE)
26+
# frollsum(list(1:4, 2:5), 2:3, partial=TRUE, adaptive=FALSE)
27+
partial2adaptive = function(x, n, align, adaptive) {
28+
if (!length(n))
29+
stopf("n must be non 0 length")
30+
if (align=="center")
31+
stopf("'partial' cannot be used together with align='center'")
32+
if (is.list(x) && length(unique(lengths(x))) != 1L)
33+
stopf("'partial' does not support variable length of columns in 'x'")
34+
len = if (is.list(x)) length(x[[1L]]) else length(x)
35+
verbose = getOption("datatable.verbose")
36+
if (!adaptive) {
37+
if (is.list(n))
38+
stopf("n must be an integer, list is accepted for adaptive TRUE")
39+
if (!is.numeric(n))
40+
stopf("n must be an integer vector or a list of integer vectors")
41+
if (verbose)
42+
catf("partial2adaptive: froll partial=TRUE trimming 'n' and redirecting to adaptive=TRUE\n")
43+
if (length(n) > 1L) {
44+
## c(2,3) -> list(c(1,2,2,2),c(1,2,3,3)) ## for x=1:4
45+
lapply(n, len, align, FUN=trimn)
46+
} else {
47+
## 3 -> c(1,2,3,3) ## for x=1:4
48+
trimn(n, len, align)
49+
}
50+
} else {
51+
if (!(is.numeric(n) || (is.list(n) && all(vapply_1b(n, is.numeric)))))
52+
stopf("n must be an integer vector or a list of integer vectors")
53+
if (length(unique(lengths(n))) != 1L)
54+
stopf("adaptive window provided in 'n' must not to have different lengths")
55+
if (is.numeric(n) && length(n) != len)
56+
stopf("length of 'n' argument must be equal to number of observations provided in 'x'")
57+
if (is.list(n) && length(n[[1L]]) != len)
58+
stopf("length of vectors in 'x' must match to length of adaptive window in 'n'")
59+
if (verbose)
60+
catf("partial2adaptive: froll adaptive=TRUE and partial=TRUE trimming 'n'\n")
61+
if (is.numeric(n)) {
62+
## c(3,3,3,2) -> c(1,2,3,2) ## for x=1:4
63+
trimnadaptive(n, align)
64+
} else {
65+
## list(c(3,3,3,2),c(4,2,3,3)) -> list(c(1,2,3,2),c(1,2,3,3)) ## for x=1:4
66+
lapply(n, align, FUN = trimnadaptive)
67+
}
68+
}
69+
}
70+
71+
# internal helper for handling give.names=TRUE
72+
make.roll.names = function(x.len, n.len, n, x.nm, n.nm, fun, adaptive) {
73+
if (is.null(n.nm)) {
74+
if (!adaptive) {
75+
if (!is.numeric(n))
76+
stopf("internal error: misuse of make.roll.names, n must be numeric for !adaptive") ## nocov
77+
n.nm = paste0("roll", fun, as.character(as.integer(n)))
78+
} else {
79+
n.nm = paste0("aroll", fun, seq_len(n.len))
80+
}
81+
} else if (!length(n.nm) && !adaptive)
82+
stopf("internal error: misuse of make.roll.names, non-null length 0 n is not possible for !adaptive") ## nocov
83+
if (is.null(x.nm)) {
84+
x.nm = paste0("V", seq_len(x.len))
85+
}
86+
ans = if (length(x.nm)) { ## is.list(x) && !is.data.frame(x)
87+
if (length(n.nm)) { ## !adaptive || is.list(n)
88+
paste(rep(x.nm, each=length(n.nm)), n.nm, sep="_")
89+
} else { ## adaptive && is.numeric(n)
90+
x.nm
91+
}
92+
} else { ## (by.column && is.atomic(x)) || (!by.column && is.data.frame(x))
93+
if (length(n.nm)) { ## !adaptive || is.list(n)
94+
n.nm
95+
} else { ## adaptive && is.numeric(n)
96+
NULL # nocov ## call to make.roll.names is excluded by is.list(ans) condition before calling it, it will be relevant for !by.column in next PR
97+
}
98+
}
99+
if (!is.null(ans) && length(ans) != x.len*n.len)
100+
stopf("internal error: make.roll.names generated names of wrong length") ## nocov
101+
ans
102+
}
103+
104+
froll = function(fun, x, n, fill=NA, algo, align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, FUN, rho, give.names=FALSE) {
4105
align = match.arg(align)
5-
ans = .Call(CfrollfunR, fun, x, n, fill, algo, align, na.rm, hasNA, adaptive)
106+
if (isTRUE(give.names)) {
107+
orig = list(n=n, adaptive=adaptive)
108+
xnam = if (is.list(x)) names(x) else character()
109+
nnam = if (isTRUE(adaptive)) {
110+
if (is.list(n)) names(n) else character()
111+
} else names(n)
112+
nx = if (is.list(x)) length(x) else 1L
113+
nn = if (isTRUE(adaptive)) {
114+
if (is.list(n)) length(n) else 1L
115+
} else length(n)
116+
}
117+
if (isTRUE(partial)) {
118+
n = partial2adaptive(x, n, align, adaptive)
119+
adaptive = TRUE
120+
}
121+
leftadaptive = isTRUE(adaptive) && align=="left"
122+
if (leftadaptive) {
123+
verbose = getOption("datatable.verbose")
124+
rev2 = function(x) if (is.list(x)) lapply(x, rev) else rev(x)
125+
if (verbose)
126+
catf("froll: adaptive=TRUE && align='left' pre-processing for align='right'\n")
127+
x = rev2(x)
128+
n = rev2(n)
129+
align = "right"
130+
} ## support for left adaptive added in #5441
131+
if (missing(FUN))
132+
ans = .Call(CfrollfunR, fun, x, n, fill, algo, align, na.rm, has.nf, adaptive)
133+
else
134+
ans = .Call(CfrollapplyR, FUN, x, n, fill, align, adaptive, rho)
135+
if (leftadaptive) {
136+
if (verbose)
137+
catf("froll: adaptive=TRUE && align='left' post-processing from align='right'\n")
138+
ans = rev2(ans)
139+
}
140+
if (isTRUE(give.names) && is.list(ans)) {
141+
nms = make.roll.names(x.len=nx, n.len=nn, n=orig$n, x.nm=xnam, n.nm=nnam, fun=fun, adaptive=orig$adaptive)
142+
setattr(ans, "names", nms)
143+
}
6144
ans
7145
}
8146

9-
frollmean = function(x, n, fill=NA, algo=c("fast", "exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
10-
froll(fun="mean", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, hasNA=hasNA, adaptive=adaptive)
147+
frollfun = function(fun, x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, hasNA, give.names=FALSE) {
148+
stopifnot(!missing(fun), is.character(fun), length(fun)==1L, !is.na(fun))
149+
if (!missing(hasNA)) {
150+
if (!is.na(has.nf))
151+
stopf("hasNA is deprecated, use has.nf instead")
152+
warningf("hasNA is deprecated, use has.nf instead")
153+
has.nf = hasNA
154+
} # remove check on next major release
155+
algo = match.arg(algo)
156+
froll(fun=fun, x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, give.names=give.names)
157+
}
158+
159+
frollmean = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, hasNA, give.names=FALSE) {
160+
frollfun(fun="mean", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
11161
}
12-
frollsum = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
13-
froll(fun="sum", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, hasNA=hasNA, adaptive=adaptive)
162+
frollsum = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, hasNA, give.names=FALSE) {
163+
frollfun(fun="sum", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
14164
}
15-
frollapply = function(x, n, FUN, ..., fill=NA, align=c("right", "left", "center")) {
165+
frollmax = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, hasNA, give.names=FALSE) {
166+
frollfun(fun="max", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
167+
}
168+
169+
frollapply = function(x, n, FUN, ..., fill=NA, align=c("right","left","center"), adaptive=FALSE, partial=FALSE, give.names=FALSE) {
16170
FUN = match.fun(FUN)
17-
align = match.arg(align)
18171
rho = new.env()
19-
ans = .Call(CfrollapplyR, FUN, x, n, fill, align, rho)
20-
ans
172+
froll(FUN=FUN, rho=rho, x=x, n=n, fill=fill, align=align, adaptive=adaptive, partial=partial, give.names=give.names)
21173
}

0 commit comments

Comments
 (0)