Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
cca5b79
frollmax exact, buggy fast, no fast adaptive
jangorecki Aug 18, 2022
14555b2
frollmax fast fixing bugs
jangorecki Aug 19, 2022
437b928
frollmax man to fix CRAN check
jangorecki Aug 19, 2022
2fd0faf
frollmax fast adaptive non NA, dev
jangorecki Aug 19, 2022
6c3201a
froll docs, adaptive left
jangorecki Aug 21, 2022
4a6f063
no frollmax fast adaptive
jangorecki Aug 21, 2022
feef63d
frollmax adaptive exact NAs handling
jangorecki Aug 21, 2022
63ea485
PR summary in news
jangorecki Aug 21, 2022
63f2e7d
align happens in one place, less duplicated code
jangorecki Aug 21, 2022
5341409
push up even more to frollR to reduce code duplication
jangorecki Aug 21, 2022
c4675be
frollapply push up align arg and early stopping up
jangorecki Aug 21, 2022
ccd5c43
typo fix in NEWS.md
jangorecki Aug 22, 2022
2494b97
keep R agnostic C code in froll.c, yet deduplicated
jangorecki Aug 22, 2022
77a01e7
new functionality unit tests
jangorecki Aug 22, 2022
311d12f
doc further improving
jangorecki Aug 25, 2022
2a54cfd
tests and NEWS
jangorecki Aug 25, 2022
5a52167
partial window support for rolling functions
jangorecki Aug 25, 2022
cab28b5
unit tests for partial corner cases
jangorecki Aug 25, 2022
161f9f5
frollapply adaptive
jangorecki Aug 25, 2022
fc69835
add frollapply to timings
jangorecki Aug 25, 2022
6c25f5f
fix missing break
jangorecki Aug 26, 2022
dd72ba0
frollmax non-adaptive, fast, exact, NAs
jangorecki Aug 26, 2022
fa36f41
fix wrong fun name in docs
jangorecki Aug 26, 2022
1e8117f
docs
jangorecki Aug 26, 2022
2f326de
more automated tests, check for SET_GROWABLE_BIT support
jangorecki Aug 27, 2022
911968f
eliminate TODOs
jangorecki Aug 27, 2022
844edc5
simplify frollmax adaptive
jangorecki Aug 27, 2022
f8a909a
docs
jangorecki Aug 27, 2022
d26aa5e
extend readme
jangorecki Aug 27, 2022
3c11ac0
readme tidy links
jangorecki Aug 27, 2022
d7bf748
bold names
jangorecki Aug 27, 2022
0385c31
mean exact no re-run NA-aware when Inf/-Inf present
jangorecki Aug 27, 2022
11dde58
handle Inf in froll mean and sum
jangorecki Aug 27, 2022
e99c381
has.nf, bring back warnings
jangorecki Aug 29, 2022
2b106ec
no need extra temp var here
jangorecki Aug 29, 2022
18b26bf
refresh timings in NEWS
jangorecki Aug 30, 2022
8a69058
deduplicate N arg handling
jangorecki Aug 30, 2022
fa796e3
minor rename
jangorecki Aug 30, 2022
a47c7a4
simplify even more k arg
jangorecki Aug 30, 2022
cb25135
remove debug line
jangorecki Sep 3, 2022
7184024
improve doc too mention adaptive rolling function to address #3241
jangorecki Sep 3, 2022
c5caf2c
fix code covr, thanks to Jim
jangorecki Sep 9, 2022
e3eaefe
fix nocov blocks
jangorecki Sep 10, 2022
1f66c43
fix codecov
jangorecki Sep 12, 2022
4218fa5
more clearly address #5306
jangorecki Sep 12, 2022
6277b71
codecov
jangorecki Sep 12, 2022
95d8cca
deduplicate partial and left adaptive logic
jangorecki Sep 12, 2022
c54ccce
give.names arg for rolling functions
jangorecki Sep 12, 2022
88a5c2d
give.names mention in news.md
jangorecki Sep 12, 2022
1661f03
fix conflict to partial preprocessing
jangorecki Sep 12, 2022
a642a99
deduplicate some code by using helper instead of macros
jangorecki Sep 14, 2022
c5417cd
simplify helper fun ansSetMsg
jangorecki Sep 14, 2022
9c4fbc8
fix codecov exposed after removing macros
jangorecki Sep 14, 2022
b3f81cd
support for adaptive and partial
jangorecki Sep 14, 2022
a14b486
added batch tests for adaptive and partial
jangorecki Sep 15, 2022
d096517
frollapply rewritten
jangorecki Oct 6, 2022
c475b5c
parallel is R core already so no need to specify it in suggests
jangorecki Oct 24, 2022
b097997
frollapply adaptive supports NAs in window, as other roll functions, …
jangorecki Oct 24, 2022
4d34f29
NEWS entry
jangorecki Oct 31, 2022
21da019
minor fix to setgrowable
jangorecki Nov 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ S3method(cube, data.table)
S3method(rollup, data.table)
export(frollmean)
export(frollsum)
export(frollmax)
export(frollapply)
export(nafill)
export(setnafill)
Expand Down
96 changes: 96 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,66 @@

# data.table [v1.14.3](https://github.com/Rdatatable/data.table/milestone/20) (in development)

## POTENTIALLY BREAKING CHANGES

1. Rolling functions `frollmean` and `frollsum` used to treat `Inf` and `-Inf` as `NA` when using default `algo="fast"`. It has been changed now and infinity values are not treated as `NA` anymore. If your input into those function has `Inf` or `-Inf` then you will be affected by this change. [#5441](https://github.com/Rdatatable/data.table/pull/5441).

## NEW FEATURES

0. (needs to be moved after rebase anyway) Function `frollapply` has been completely rewritten. Be sure to read `frollapply` manual before using the function.

- All basic types are now supported on input/output, not only double. Users code could possibly break if it depends on forced coercion of input/output to double type.
```r
frollapply(c(F,T,F,F,F,T), 2, any)
#[1] NA TRUE TRUE FALSE FALSE TRUE
## used to be: NA,1,1,0,0,1
```

- new argument `by.column` allowing to pass a multi-column subset of a data.table into a rolling function, closes [#4887](https://github.com/Rdatatable/data.table/issues/4887).
```r
x = as.data.table(iris)
flow = function(x) {
v1 = x[[1L]]
v2 = x[[2L]]
(v1[2L] - v1[1L] * (1+v2[2L])) / v1[1L]
}
x[, "flow" := frollapply(.(Sepal.Length, Sepal.Width), 2, flow, by.column=FALSE),
by = Species][]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species flow
# <num> <num> <num> <num> <fctr> <num>
# 1: 5.1 3.5 1.4 0.2 setosa NA
# 2: 4.9 3.0 1.4 0.2 setosa -3.039216
# 3: 4.7 3.2 1.3 0.2 setosa -3.240816
# 4: 4.6 3.1 1.5 0.2 setosa -3.121277
# 5: 5.0 3.6 1.4 0.2 setosa -3.513043
# ---
#146: 6.7 3.0 5.2 2.3 virginica -3.000000
#147: 6.3 2.5 5.0 1.9 virginica -2.559701
#148: 6.5 3.0 5.2 2.0 virginica -2.968254
#149: 6.2 3.4 5.4 2.3 virginica -3.446154
#150: 5.9 3.0 5.1 1.8 virginica -3.048387
```

- uses multiple CPU threads; evaluate UDF is inherently slow so this can be a big help.
```r
x = rnorm(1e5)
n = 500
setDTthreads(1)
system.time(
th1 <- frollapply(x, n, median, simplify=unlist)
)
# user system elapsed
# 4.106 0.008 4.115
setDTthreads(4)
system.time(
th4 <- frollapply(x, n, median, simplify=unlist)
)
# user system elapsed
# 5.778 0.140 1.498
all.equal(th1, th4)
#[1] TRUE
```

1. `nafill()` now applies `fill=` to the front/back of the vector when `type="locf|nocb"`, [#3594](https://github.com/Rdatatable/data.table/issues/3594). Thanks to @ben519 for the feature request. It also now returns a named object based on the input names. Note that if you are considering joining and then using `nafill(...,type='locf|nocb')` afterwards, please review `roll=`/`rollends=` which should achieve the same result in one step more efficiently. `nafill()` is for when filling-while-joining (i.e. `roll=`/`rollends=`/`nomatch=`) cannot be applied.

2. `mean(na.rm=TRUE)` by group is now GForce optimized, [#4849](https://github.com/Rdatatable/data.table/issues/4849). Thanks to the [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) project for spotting this issue. The 1 billion row example in the issue shows 48s reduced to 14s. The optimization also applies to type `integer64` resulting in a difference to the `bit64::mean.integer64` method: `data.table` returns a `double` result whereas `bit64` rounds the mean to the nearest integer.
Expand Down Expand Up @@ -296,6 +354,44 @@

41. New function `%notin%` provides a convenient alternative to `!(x %in% y)`, [#4152](https://github.com/Rdatatable/data.table/issues/4152). Thanks to Jan Gorecki for suggesting and Michael Czekanski for the PR. `%notin%` uses half the memory because it computes the result directly as opposed to `!` which allocates a new vector to hold the negated result. If `x` is long enough to occupy more than half the remaining free memory, this can make the difference between the operation working, or failing with an out-of-memory error.

42. Multiple improvements has been added to rolling functions. Request came from @gpierard who needed left aligned, adaptive, rolling max, [#5438](https://github.com/Rdatatable/data.table/issues/5438). There was no `frollmax` function yet. Adaptive rolling functions did not have support for `align="left"`. `frollapply` did not support `adaptive=TRUE`. Available alternatives were base R `mapply` or self-join using `max` and grouping `by=.EACHI`. As a follow up of his request, following features has been added:
- new function `frollmax`, applies `max` over a rolling window.
- support for `align="left"` for adaptive rolling function.
- support for `adaptive=TRUE` in `frollapply`.
- `partial` argument to trim window width to available observations rather than returning `NA` whenever window is not complete.
- `give.names` argument that can be used to automatically give the names based on the names of `x` and `n`.
- `frollmean` and `frollsum` no longer treat `Inf` and `-Inf` as `NA`s as it used to be for `algo="fast"` (breaking change).
- `hasNA` argument has been renamed to `has.nf` to convey that it is not only related to `NA/NaN` but other non-finite values (`Inf/-Inf`) as well.

For a comprehensive description about all available features see `?froll` manual.

Adaptive `frollmax` has observed to be up to 50 times faster than second fastest solution (data.table self-join using `max` and grouping `by=.EACHI`). Note that important factor in performance is width of the rolling window. Code for the benchmark below has been taken from [this SO answer](https://stackoverflow.com/a/73408459/2490497).
```r
set.seed(108)
setDTthreads(8)
x = data.table(
value = cumsum(rnorm(1e6, 0.1)),
end_window = 1:1e6 + sample(50:500, 1e6, TRUE),
row = 1:1e6
)[, "end_window" := pmin(end_window, .N)
][, "len_window" := end_window-row+1L]

baser = function(x) x[, mapply(function(from, to) max(value[from:to]), row, end_window)]
sj = function(x) x[x, max(value), on=.(row >= row, row <= end_window), by=.EACHI]$V1
frmax = function(x) x[, frollmax(value, len_window, adaptive=TRUE, align="left", has.nf=FALSE)]
frapply = function(x) x[, frollapply(value, len_window, max, adaptive=TRUE, align="left")]
microbenchmark::microbenchmark(
baser(x), sj(x), frmax(x), frapply(x),
times=10, check="identical"
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# baser(x) 5181.36076 5417.57505 5537.2929 5494.73652 5706.2721 5818.6627 10
# sj(x) 4608.28940 4627.57186 4792.4031 4785.35306 4856.4475 5054.3301 10
# frmax(x) 70.41253 75.28659 91.3774 91.40227 102.0248 116.8622 10
# frapply(x) 713.23108 742.34657 865.2524 848.31641 965.3599 1114.0531 10
```

## BUG FIXES

1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.
Expand Down
146 changes: 134 additions & 12 deletions R/froll.R
Original file line number Diff line number Diff line change
@@ -1,21 +1,143 @@
froll = function(fun, x, n, fill=NA, algo=c("fast", "exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
## those two helpers does not quote argument names in errors because frollapply has them in uppercase
partial2adaptive = function(x, n, align, adaptive) {
if (align=="center")
stopf("'partial' cannot be used together with align='center'")
if (is.list(x)) {
if (!is.data.frame(x) && length(unique(vapply(x, length, 0L)))!=1L) ## froll
stopf("'partial' does not support variable length of columns in x")
else if (all(vapply(x, is.data.frame, FALSE)) && length(unique(vapply(x, nrow, 0L)))!=1L) ## frollapply by.column=F, single DT already wrapped into list
stopf("'partial' does not support variable nrow of data.tables in x")
}
if (!adaptive) {
if (is.list(n))
stopf("n must be integer, list is accepted for adaptive TRUE")
else if (!is.numeric(n))
stopf("n must be integer vector")
} else if (!(is.numeric(n) || (is.list(n) && all(vapply(n, is.numeric, FALSE))))) {
stopf("n must be integer vector or list of integer vectors")
}
len = if (is.list(x)) {
if (is.data.frame(x[[1L]])) ## frollapply by.column
nrow(x[[1L]])
else
length(x[[1L]]) ## froll, this will work for both x list and x dt on input
} else length(x)
verbose = getOption("datatable.verbose")
if (!adaptive) {
n = as.list(n) ## test 6006.032
if (verbose)
cat("partial2adaptive: froll partial=TRUE trimming n and redirecting to adaptive=TRUE\n")
trimn = function(n, len, align) {
n = min(n, len)
if (align=="right")
c(seq_len(n), rep(n, len-n))
else
c(rep(n, len-n), rev(seq_len(n)))
}
sapply(n, len, align, FUN=trimn, simplify=FALSE)
} else {
if (!is.list(n)) n = list(n)
if (length(unique(vapply(n, length, 0L)))!=1L)
stopf("adaptive windows provided in n must not to have different lengths")
if (length(n[[1L]]) != len)
stopf("length of vectors in x must match to length of adaptive window in n")
if (verbose)
cat("partial2adaptive: froll adaptive=TRUE and partial=TRUE trimming n\n")
triman = function(n, align) {
if (align=="right")
pmin(n, seq_along(n))
else
pmin(n, rev(seq_along(n)))
}
sapply(n, align, FUN=triman, simplify=FALSE)
}
}
make.roll.names = function(x.len, n.len, n, x.nm, n.nm, fun, adaptive) {
if (is.null(n.nm)) {
if (!adaptive) {
if (!is.numeric(n))
stopf("internal error: misuse of make.names, n must be numeric for !adaptive") ## nocov
n.nm = paste0("roll", fun, as.character(as.integer(n)))
} else {
n.nm = paste0("aroll", fun, seq_len(n.len))
}
} else if (!length(n.nm) && !adaptive)
stopf("internal error: misuse of make.names, non-null length 0 n is not possible for !adaptive") ## nocov
if (is.null(x.nm)) {
x.nm = paste0("V", seq_len(x.len))
}
ans = if (length(x.nm)) { ## is.list(x) && !is.data.frame(x)
if (length(n.nm)) { ## !adaptive || is.list(n)
paste(rep(x.nm, each=length(n.nm)), n.nm, sep="_")
} else { ## adaptive && is.numeric(n)
x.nm
}
} else { ## (by.column && is.atomic(x)) || (!by.column && is.data.frame(x))
if (length(n.nm)) { ## !adaptive || is.list(n)
n.nm
} else { ## adaptive && is.numeric(n)
NULL
}
}
if (!is.null(ans) && length(ans) != x.len*n.len)
stopf("internal error: make.names generated names of wrong length") ## nocov
ans
}

froll = function(fun, x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA=NA) {
stopifnot(!missing(fun), is.character(fun), length(fun)==1L, !is.na(fun))
if (!missing(hasNA)) {
if (!is.na(has.nf))
stopf("hasNA is deprecated, use has.nf instead")
warning("hasNA is deprecated, use has.nf instead")
has.nf = hasNA
} # remove check on next major release
algo = match.arg(algo)
align = match.arg(align)
ans = .Call(CfrollfunR, fun, x, n, fill, algo, align, na.rm, hasNA, adaptive)
if (isTRUE(give.names)) {
orig = list(n=n, adaptive=adaptive)
xnam = if (is.list(x)) names(x) else character()
nnam = if (isTRUE(adaptive)) {
if (is.list(n)) names(n) else character()
} else names(n)
nx = if (is.list(x)) length(x) else 1L
nn = if (isTRUE(adaptive)) {
if (is.list(n)) length(n) else 1L
} else length(n)
}
if (isTRUE(partial)) {
n = partial2adaptive(x, n, align, adaptive)
adaptive = TRUE
} ## support for partial added in #5441
leftadaptive = isTRUE(adaptive) && align=="left"
if (leftadaptive) {
verbose = getOption("datatable.verbose")
rev2 = function(x) if (is.list(x)) sapply(x, rev, simplify=FALSE) else rev(x)
if (verbose)
cat("froll: adaptive=TRUE && align='left' pre-processing for align='right'\n")
x = rev2(x)
n = rev2(n)
align = "right"
} ## support for left adaptive added in #5441
ans = .Call(CfrollfunR, fun, x, n, fill, algo, align, na.rm, has.nf, adaptive)
if (leftadaptive) {
if (verbose)
cat("froll: adaptive=TRUE && align='left' post-processing from align='right'\n")
ans = rev2(ans)
}
if (isTRUE(give.names) && is.list(ans)) {
nms = make.roll.names(x.len=nx, n.len=nn, n=orig$n, x.nm=xnam, n.nm=nnam, fun=fun, adaptive=orig$adaptive)
setattr(ans, "names", nms)
}
ans
}

frollmean = function(x, n, fill=NA, algo=c("fast", "exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
froll(fun="mean", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, hasNA=hasNA, adaptive=adaptive)
frollmean = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA) {
froll(fun="mean", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
}
frollsum = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
froll(fun="sum", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, hasNA=hasNA, adaptive=adaptive)
frollsum = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA) {
froll(fun="sum", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
}
frollapply = function(x, n, FUN, ..., fill=NA, align=c("right", "left", "center")) {
FUN = match.fun(FUN)
align = match.arg(align)
rho = new.env()
ans = .Call(CfrollapplyR, FUN, x, n, fill, align, rho)
ans
frollmax = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA) {
froll(fun="max", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
}
Loading