Rdatatable · jangorecki · Aug 18, 2022 · Aug 19, 2022 · Aug 19, 2022 · Aug 19, 2022
@@ -51,6 +51,7 @@ S3method(cube, data.table)
 S3method(rollup, data.table)
 export(frollmean)
 export(frollsum)
+export(frollmax)
 export(frollapply)
 export(nafill)
 export(setnafill)

@@ -4,8 +4,66 @@
 
 # data.table [v1.14.3](https://github.com/Rdatatable/data.table/milestone/20)  (in development)
 
+## POTENTIALLY BREAKING CHANGES
+
+1. Rolling functions `frollmean` and `frollsum` used to treat `Inf` and `-Inf` as `NA` when using default `algo="fast"`. It has been changed now and infinity values are not treated as `NA` anymore. If your input into those function has `Inf` or `-Inf` then you will be affected by this change. [#5441](https://github.com/Rdatatable/data.table/pull/5441).
+
 ## NEW FEATURES
 
+0. (needs to be moved after rebase anyway) Function `frollapply` has been completely rewritten. Be sure to read `frollapply` manual before using the function.
+
+- All basic types are now supported on input/output, not only double. Users code could possibly break if it depends on forced coercion of input/output to double type.
+```r
+frollapply(c(F,T,F,F,F,T), 2, any)
+#[1]    NA  TRUE  TRUE FALSE FALSE  TRUE
+## used to be: NA,1,1,0,0,1
+```
+
+- new argument `by.column` allowing to pass a multi-column subset of a data.table into a rolling function, closes [#4887](https://github.com/Rdatatable/data.table/issues/4887).
+```r
+x = as.data.table(iris)
+flow = function(x) {
+  v1 = x[[1L]]
+  v2 = x[[2L]]
+  (v1[2L] - v1[1L] * (1+v2[2L])) / v1[1L]
+}
+x[, "flow" := frollapply(.(Sepal.Length, Sepal.Width), 2, flow, by.column=FALSE),
+  by = Species][]
+#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species      flow
+#            <num>       <num>        <num>       <num>    <fctr>     <num>
+#  1:          5.1         3.5          1.4         0.2    setosa        NA
+#  2:          4.9         3.0          1.4         0.2    setosa -3.039216
+#  3:          4.7         3.2          1.3         0.2    setosa -3.240816
+#  4:          4.6         3.1          1.5         0.2    setosa -3.121277
+#  5:          5.0         3.6          1.4         0.2    setosa -3.513043
+# ---
+#146:          6.7         3.0          5.2         2.3 virginica -3.000000
+#147:          6.3         2.5          5.0         1.9 virginica -2.559701
+#148:          6.5         3.0          5.2         2.0 virginica -2.968254
+#149:          6.2         3.4          5.4         2.3 virginica -3.446154
+#150:          5.9         3.0          5.1         1.8 virginica -3.048387
+```
+
+- uses multiple CPU threads; evaluate UDF is inherently slow so this can be a big help.
+```r
+x = rnorm(1e5)
+n = 500
+setDTthreads(1)
+system.time(
+  th1 <- frollapply(x, n, median, simplify=unlist)
+)
+#   user  system elapsed
+#  4.106   0.008   4.115
+setDTthreads(4)
+system.time(
+  th4 <- frollapply(x, n, median, simplify=unlist)
+)
+#   user  system elapsed
+#  5.778   0.140   1.498
+all.equal(th1, th4)
+#[1] TRUE
+```
+
 1. `nafill()` now applies `fill=` to the front/back of the vector when `type="locf|nocb"`, [#3594](https://github.com/Rdatatable/data.table/issues/3594). Thanks to @ben519 for the feature request. It also now returns a named object based on the input names. Note that if you are considering joining and then using `nafill(...,type='locf|nocb')` afterwards, please review `roll=`/`rollends=` which should achieve the same result in one step more efficiently. `nafill()` is for when filling-while-joining (i.e. `roll=`/`rollends=`/`nomatch=`) cannot be applied.
 
 2. `mean(na.rm=TRUE)` by group is now GForce optimized, [#4849](https://github.com/Rdatatable/data.table/issues/4849). Thanks to the [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) project for spotting this issue. The 1 billion row example in the issue shows 48s reduced to 14s. The optimization also applies to type `integer64` resulting in a difference to the `bit64::mean.integer64` method: `data.table` returns a `double` result whereas `bit64` rounds the mean to the nearest integer.
@@ -296,6 +354,44 @@
 
 41. New function `%notin%` provides a convenient alternative to `!(x %in% y)`, [#4152](https://github.com/Rdatatable/data.table/issues/4152). Thanks to Jan Gorecki for suggesting and Michael Czekanski for the PR. `%notin%` uses half the memory because it computes the result directly as opposed to `!` which allocates a new vector to hold the negated result. If `x` is long enough to occupy more than half the remaining free memory, this can make the difference between the operation working, or failing with an out-of-memory error.
 
+42. Multiple improvements has been added to rolling functions. Request came from @gpierard who needed left aligned, adaptive, rolling max, [#5438](https://github.com/Rdatatable/data.table/issues/5438). There was no `frollmax` function yet. Adaptive rolling functions did not have support for `align="left"`. `frollapply` did not support `adaptive=TRUE`. Available alternatives were base R `mapply` or self-join using `max` and grouping `by=.EACHI`. As a follow up of his request, following features has been added:
+- new function `frollmax`, applies `max` over a rolling window.
+- support for `align="left"` for adaptive rolling function.
+- support for `adaptive=TRUE` in `frollapply`.
+- `partial` argument to trim window width to available observations rather than returning `NA` whenever window is not complete.
+- `give.names` argument that can be used to automatically give the names based on the names of `x` and `n`.
+- `frollmean` and `frollsum` no longer treat `Inf` and `-Inf` as `NA`s as it used to be for `algo="fast"` (breaking change).
+- `hasNA` argument has been renamed to `has.nf` to convey that it is not only related to `NA/NaN` but other non-finite values (`Inf/-Inf`) as well.
+
+For a comprehensive description about all available features see `?froll` manual.
+
+Adaptive `frollmax` has observed to be up to 50 times faster than second fastest solution (data.table self-join using `max` and grouping `by=.EACHI`). Note that important factor in performance is width of the rolling window. Code for the benchmark below has been taken from [this SO answer](https://stackoverflow.com/a/73408459/2490497).
+```r
+set.seed(108)
+setDTthreads(8)
+x = data.table(
+  value = cumsum(rnorm(1e6, 0.1)),
+  end_window = 1:1e6 + sample(50:500, 1e6, TRUE),
+  row = 1:1e6
+)[, "end_window" := pmin(end_window, .N)
+  ][, "len_window" := end_window-row+1L]
+
+baser = function(x) x[, mapply(function(from, to) max(value[from:to]), row, end_window)]
+sj = function(x) x[x, max(value), on=.(row >= row, row <= end_window), by=.EACHI]$V1
+frmax = function(x) x[, frollmax(value, len_window, adaptive=TRUE, align="left", has.nf=FALSE)]
+frapply = function(x) x[, frollapply(value, len_window, max, adaptive=TRUE, align="left")]
+microbenchmark::microbenchmark(
+  baser(x), sj(x), frmax(x), frapply(x),
+  times=10, check="identical"
+)
+#Unit: milliseconds
+#       expr        min         lq      mean     median        uq       max neval
+#   baser(x) 5181.36076 5417.57505 5537.2929 5494.73652 5706.2721 5818.6627    10
+#      sj(x) 4608.28940 4627.57186 4792.4031 4785.35306 4856.4475 5054.3301    10
+#   frmax(x)   70.41253   75.28659   91.3774   91.40227  102.0248  116.8622    10
+# frapply(x)  713.23108  742.34657  865.2524  848.31641  965.3599 1114.0531    10
+```
+
 ## BUG FIXES
 
 1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.

@@ -1,21 +1,143 @@
-froll = function(fun, x, n, fill=NA, algo=c("fast", "exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
+## those two helpers does not quote argument names in errors because frollapply has them in uppercase
+partial2adaptive = function(x, n, align, adaptive) {
+  if (align=="center")
+    stopf("'partial' cannot be used together with align='center'")
+  if (is.list(x)) {
+    if (!is.data.frame(x) && length(unique(vapply(x, length, 0L)))!=1L) ## froll
+      stopf("'partial' does not support variable length of columns in x")
+    else if (all(vapply(x, is.data.frame, FALSE)) && length(unique(vapply(x, nrow, 0L)))!=1L) ## frollapply by.column=F, single DT already wrapped into list
+      stopf("'partial' does not support variable nrow of data.tables in x")
+  }
+  if (!adaptive) {
+    if (is.list(n))
+      stopf("n must be integer, list is accepted for adaptive TRUE")
+    else if (!is.numeric(n))
+      stopf("n must be integer vector")
+  } else if (!(is.numeric(n) || (is.list(n) && all(vapply(n, is.numeric, FALSE))))) {
+    stopf("n must be integer vector or list of integer vectors")
+  }
+  len = if (is.list(x)) {
+    if (is.data.frame(x[[1L]])) ## frollapply by.column
+      nrow(x[[1L]])
+    else
+      length(x[[1L]]) ## froll, this will work for both x list and x dt on input
+  } else length(x)
+  verbose = getOption("datatable.verbose")
+  if (!adaptive) {
+    n = as.list(n) ## test 6006.032
+    if (verbose)
+      cat("partial2adaptive: froll partial=TRUE trimming n and redirecting to adaptive=TRUE\n")
+    trimn = function(n, len, align) {
+      n = min(n, len)
+      if (align=="right")
+        c(seq_len(n), rep(n, len-n))
+      else
+        c(rep(n, len-n), rev(seq_len(n)))
+    }
+    sapply(n, len, align, FUN=trimn, simplify=FALSE)
+  } else {
+    if (!is.list(n)) n = list(n)
+    if (length(unique(vapply(n, length, 0L)))!=1L)
+      stopf("adaptive windows provided in n must not to have different lengths")
+    if (length(n[[1L]]) != len)
+      stopf("length of vectors in x must match to length of adaptive window in n")
+    if (verbose)
+      cat("partial2adaptive: froll adaptive=TRUE and partial=TRUE trimming n\n")
+    triman = function(n, align) {
+      if (align=="right")
+        pmin(n, seq_along(n))
+      else
+        pmin(n, rev(seq_along(n)))
+    }
+    sapply(n, align, FUN=triman, simplify=FALSE)
+  }
+}
+make.roll.names = function(x.len, n.len, n, x.nm, n.nm, fun, adaptive) {
+  if (is.null(n.nm)) {
+    if (!adaptive) {
+      if (!is.numeric(n))
+        stopf("internal error: misuse of make.names, n must be numeric for !adaptive") ## nocov
+      n.nm = paste0("roll", fun, as.character(as.integer(n)))
+    } else {
+      n.nm = paste0("aroll", fun, seq_len(n.len))
+    }
+  } else if (!length(n.nm) && !adaptive)
+    stopf("internal error: misuse of make.names, non-null length 0 n is not possible for !adaptive") ## nocov
+  if (is.null(x.nm)) {
+    x.nm = paste0("V", seq_len(x.len))
+  }
+  ans = if (length(x.nm)) { ## is.list(x) && !is.data.frame(x)
+    if (length(n.nm)) { ## !adaptive || is.list(n)
+      paste(rep(x.nm, each=length(n.nm)), n.nm, sep="_")
+    } else { ## adaptive && is.numeric(n)
+      x.nm
+    }
+  } else { ## (by.column && is.atomic(x)) || (!by.column && is.data.frame(x))
+    if (length(n.nm)) { ## !adaptive || is.list(n)
+      n.nm
+    } else { ## adaptive && is.numeric(n)
+      NULL
+    }
+  }
+  if (!is.null(ans) && length(ans) != x.len*n.len)
+    stopf("internal error: make.names generated names of wrong length") ## nocov
+  ans
+}
+
+froll = function(fun, x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA=NA) {
   stopifnot(!missing(fun), is.character(fun), length(fun)==1L, !is.na(fun))
+  if (!missing(hasNA)) {
+    if (!is.na(has.nf))
+      stopf("hasNA is deprecated, use has.nf instead")
+    warning("hasNA is deprecated, use has.nf instead")
+    has.nf = hasNA
+  } # remove check on next major release
   algo = match.arg(algo)
   align = match.arg(align)
-  ans = .Call(CfrollfunR, fun, x, n, fill, algo, align, na.rm, hasNA, adaptive)
+  if (isTRUE(give.names)) {
+    orig = list(n=n, adaptive=adaptive)
+    xnam = if (is.list(x)) names(x) else character()
+    nnam = if (isTRUE(adaptive)) {
+      if (is.list(n)) names(n) else character()
+    } else names(n)
+    nx = if (is.list(x)) length(x) else 1L
+    nn = if (isTRUE(adaptive)) {
+      if (is.list(n)) length(n) else 1L
+    } else length(n)
+  }
+  if (isTRUE(partial)) {
+    n = partial2adaptive(x, n, align, adaptive)
+    adaptive = TRUE
+  } ## support for partial added in #5441
+  leftadaptive = isTRUE(adaptive) && align=="left"
+  if (leftadaptive) {
+    verbose = getOption("datatable.verbose")
+    rev2 = function(x) if (is.list(x)) sapply(x, rev, simplify=FALSE) else rev(x)
+    if (verbose)
+      cat("froll: adaptive=TRUE && align='left' pre-processing for align='right'\n")
+    x = rev2(x)
+    n = rev2(n)
+    align = "right"
+  } ## support for left adaptive added in #5441
+  ans = .Call(CfrollfunR, fun, x, n, fill, algo, align, na.rm, has.nf, adaptive)
+  if (leftadaptive) {
+    if (verbose)
+      cat("froll: adaptive=TRUE && align='left' post-processing from align='right'\n")
+    ans = rev2(ans)
+  }
+  if (isTRUE(give.names) && is.list(ans)) {
+    nms = make.roll.names(x.len=nx, n.len=nn, n=orig$n, x.nm=xnam, n.nm=nnam, fun=fun, adaptive=orig$adaptive)
+    setattr(ans, "names", nms)
+  }
   ans
 }
 
-frollmean = function(x, n, fill=NA, algo=c("fast", "exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
-  froll(fun="mean", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, hasNA=hasNA, adaptive=adaptive)
+frollmean = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA) {
+  froll(fun="mean", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
 }
-frollsum = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
-  froll(fun="sum", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, hasNA=hasNA, adaptive=adaptive)
+frollsum = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA) {
+  froll(fun="sum", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
 }
-frollapply = function(x, n, FUN, ..., fill=NA, align=c("right", "left", "center")) {
-  FUN = match.fun(FUN)
-  align = match.arg(align)
-  rho = new.env()
-  ans = .Call(CfrollapplyR, FUN, x, n, fill, align, rho)
-  ans
+frollmax = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA) {
+  froll(fun="max", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
 }