|
4 | 4 |
|
5 | 5 | ## data.table [v1.17.99](https://github.com/Rdatatable/data.table/milestone/35) (in development) |
6 | 6 |
|
| 7 | +### BREAKING CHANGE |
| 8 | + |
| 9 | +1. `dcast()` now errors when `fun.aggregate` returns length != 1 (consistent with documentation), regardless of `fill`, [#6629](https://github.com/Rdatatable/data.table/issues/6629). Previously, when `fill` was not `NULL`, `dcast` warned and returned an undefined result. This change has been planned since 1.16.0 (25 Aug 2024). |
| 10 | + |
| 11 | +2. `melt()` returns an integer column for `variable` when `measure.vars` is a list of length=1, consistent with the documented behavior, [#5209](https://github.com/Rdatatable/data.table/issues/5209). Thanks to @tdhock for reporting. Any users who were relying on this behavior can change `measure.vars=list("col_name")` (output `variable` was column name, now is column index/integer) to `measure.vars="col_name"` (`variable` still is column name). This change has been planned since 1.16.0 (25 Aug 2024). |
| 12 | + |
| 13 | +3. Rolling functions `frollmean` and `frollsum` distinguish `Inf`/`-Inf` from `NA` to match the same rules as base R when `algo="fast"` (previously they were considered the same). If your input into those functions has `Inf` or `-Inf` then you will be affected by this change. As a result, the argument that controls the handling of `NA`s has been renamed from `hasNA` to `has.nf` (_has non-finite_). `hasNA` continues to work with a warning, for now. |
| 14 | + ```r |
| 15 | + ## before |
| 16 | + frollsum(c(1,2,3,Inf,5,6), 2) |
| 17 | + #[1] NA 3 5 NA NA 11 |
| 18 | + |
| 19 | + ## now |
| 20 | + frollsum(c(1,2,3,Inf,5,6), 2) |
| 21 | + #[1] NA 3 5 Inf Inf 11 |
| 22 | + |
| 23 | +4. `frollapply` result is not coerced to numeric anymore. Users' code could possibly break if it depends on forced coercion of input/output to numeric type. |
| 24 | + ```r |
| 25 | + ## before |
| 26 | + frollapply(c(F,T,F,F,F,T), 2, any) |
| 27 | + #[1] NA 1 1 0 0 1 |
| 28 | +
|
| 29 | + ## now |
| 30 | + frollapply(c(F,T,F,F,F,T), 2, any) |
| 31 | + #[1] NA TRUE TRUE FALSE FALSE TRUE |
| 32 | + ``` |
| 33 | + Additionally argument names in `frollapply` has been renamed from `x` to `X` and `n` to `N` to avoid conflicts with common argument names that may be passed to `...`, aligning to base R API of `lapply`. `x` and `n` continue to work with a warning, for now. |
| 34 | +
|
| 35 | +5. Negative and missing values of `n` argument of adaptive rolling functions trigger an error. |
| 36 | +
|
7 | 37 | ### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES |
8 | 38 |
|
9 | 39 | 1. `data.table(x=1, <expr>)`, where `<expr>` is an expression resulting in a 1-column matrix without column names, will eventually have names `x` and `V2`, not `x` and `V1`, consistent with `data.table(x=1, <expr>)` where `<expr>` results in an atomic vector, for example `data.table(x=1, cbind(1))` and `data.table(x=1, 1)` will both have columns named `x` and `V2`. In this release, the matrix case continues to be named `V1`, but the new behavior can be activated by setting `options(datatable.old.matrix.autoname)` to `FALSE`. See point 5 under Bug Fixes for more context; this change will provide more internal consistency as well as more consistency with `data.frame()`. |
|
63 | 93 |
|
64 | 94 | 12. New `cbindlist()` and `setcbindlist()` for concatenating a `list` of data.tables column-wise, evocative of the analogous `do.call(rbind, l)` <-> `rbindlist(l)`, [#2576](https://github.com/Rdatatable/data.table/issues/2576). `setcbindlist()` does so without making any copies. Thanks @MichaelChirico for the FR, @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning. |
65 | 95 |
|
| 96 | + ```r |
| 97 | + l = list( |
| 98 | + data.table(id = 1:3, a = letters[1:3]), |
| 99 | + data.table(b = 4:6, c = 7:9) |
| 100 | + ) |
| 101 | + cbindlist(l) |
| 102 | + # id a b c |
| 103 | + # 1: 1 a 4 7 |
| 104 | + # 2: 2 b 5 8 |
| 105 | + # 3: 3 c 6 9 |
| 106 | + ``` |
| 107 | +
|
66 | 108 | 13. New `mergelist()` and `setmergelist()` similarly work _a la_ `Reduce()` to recursively merge a `list` of data.tables, [#599](https://github.com/Rdatatable/data.table/issues/599). Different join modes (_left_, _inner_, _full_, _right_, _semi_, _anti_, and _cross_) are supported through the `how` argument; duplicate handling goes through the `mult` argument. `setmergelist()` carefully avoids copies where one is not needed, e.g. in a 1:1 left join. Thanks Patrick Nicholson for the FR (in 2013!), @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning. |
67 | 109 |
|
| 110 | + ```r |
| 111 | + l = list( |
| 112 | + data.table(id = c(1L, 2L, 3L), x = c("a", "b", "c")), |
| 113 | + data.table(id = c(1L, 2L, 4L), y = c("d", "e", "f")), |
| 114 | + data.table(id = c(1L, 3L, 4L), z = c("g", "h", "i")) |
| 115 | + ) |
| 116 | +
|
| 117 | + # Recursive inner join |
| 118 | + mergelist(l, on = "id", how = "inner") |
| 119 | + # id x y z |
| 120 | + # 1: 1 a d g |
| 121 | +
|
| 122 | + # Recursive left join (the default 'how') |
| 123 | + mergelist(l, on = "id", how = "left") |
| 124 | + # id x y z |
| 125 | + # 1: 1 a d g |
| 126 | + # 2: 2 b e <NA> |
| 127 | + # 3: 3 c <NA> h |
| 128 | + ``` |
| 129 | +
|
68 | 130 | 14. `fcoalesce()` and `setcoalesce()` gain `nan` argument to control whether `NaN` values should be treated as missing (`nan=NA`, the default) or non-missing (`nan=NaN`), [#4567](https://github.com/Rdatatable/data.table/issues/4567). This provides full compatibility with `nafill()` behavior. Thanks to @ethanbsmith for the feature request and @Mukulyadav2004 for the implementation. |
69 | 131 |
|
70 | 132 | 15. New function `isoyear()` has been implemented as a complement to `isoweek()`, returning the ISO 8601 year corresponding to a given date, [#7154](https://github.com/Rdatatable/data.table/issues/7154). Thanks to @ben-schwen and @MichaelChirico for the suggestion and @venom1204 for the implementation. |
71 | 133 |
|
| 134 | +16. Multiple improvements have been added to rolling functions. Request came from @gpierard who needed left aligned, adaptive, rolling max, [#5438](https://github.com/Rdatatable/data.table/issues/5438). There was no `frollmax` function yet. Adaptive rolling functions did not have support for `align="left"`. `frollapply` did not support `adaptive=TRUE`. Available alternatives were base R `mapply` or self-join using `max` and grouping `by=.EACHI`. As a follow up of his request, the following features have been added: |
| 135 | + - new function `frollmax`, applies `max` over a rolling window. |
| 136 | + - support for `align="left"` for adaptive rolling function. |
| 137 | + - support for `adaptive=TRUE` in `frollapply`. |
| 138 | + - `partial` argument to trim window width to available observations rather than returning `NA` whenever window is not complete. |
| 139 | + - `give.names` argument that can be used to automatically give the names based on the names of `x` and `n`. |
| 140 | + - `frollmean` and `frollsum` no longer treat `Inf` and `-Inf` as `NA`s as it used to be for `algo="fast"` (breaking change). |
| 141 | + - `hasNA` argument has been renamed to `has.nf` to convey that it is not only related to `NA/NaN` but other non-finite values (`Inf/-Inf`) as well. |
| 142 | +
|
| 143 | + Thanks to @jangorecki for implementation and @MichaelChirico and others for work on splitting into smaller PRs and reviews. |
| 144 | + For a comprehensive description about all available features see `?froll` manual. |
| 145 | +
|
| 146 | + Adaptive `frollmax` has observed to be around 80 times faster than second fastest solution (data.table self-join using `max` and grouping `by=.EACHI`). Note that important factor in performance is width of the rolling window. Code for the benchmark below has been taken from [this SO answer](https://stackoverflow.com/a/73408459/2490497). |
| 147 | + ```r |
| 148 | + set.seed(108) |
| 149 | + setDTthreads(16) |
| 150 | + x = data.table( |
| 151 | + value = cumsum(rnorm(1e6, 0.1)), |
| 152 | + end_window = 1:1e6 + sample(50:500, 1e6, TRUE), |
| 153 | + row = 1:1e6 |
| 154 | + )[, "end_window" := pmin(end_window, .N) |
| 155 | + ][, "len_window" := end_window-row+1L] |
| 156 | + baser = function(x) x[, mapply(function(from, to) max(value[from:to]), row, end_window)] |
| 157 | + sj = function(x) x[x, max(value), on=.(row >= row, row <= end_window), by=.EACHI]$V1 |
| 158 | + frmax = function(x) x[, frollmax(value, len_window, adaptive=TRUE, align="left", has.nf=FALSE)] |
| 159 | + frapply = function(x) x[, frollapply(value, len_window, max, adaptive=TRUE, align="left")] |
| 160 | + microbenchmark::microbenchmark( |
| 161 | + baser(x), sj(x), frmax(x), frapply(x), |
| 162 | + times=10, check="identical" |
| 163 | + ) |
| 164 | + #Unit: milliseconds |
| 165 | + # expr min lq mean median uq max neval |
| 166 | + # baser(x) 3094.88357 3097.84966 3186.74832 3163.58050 3251.66753 3370.33785 10 |
| 167 | + # sj(x) 2221.55456 2255.12083 2306.61382 2303.47883 2346.70293 2412.62975 10 |
| 168 | + # frmax(x) 17.45124 24.16809 28.10062 28.58153 32.79802 34.83941 10 |
| 169 | + # frapply(x) 272.07830 316.47060 366.94771 396.23566 416.06699 421.38701 10 |
| 170 | + ``` |
| 171 | +
|
| 172 | + As of now, adaptive rolling max has no _on-line_ implementation (`algo="fast"`), it uses a naive approach (`algo="exact"`). Therefore further speed up is still possible if `algo="fast"` gets implemented. |
| 173 | +
|
| 174 | +17. Function `frollapply` has been completely rewritten. Thanks to @jangorecki for implementation. Be sure to read `frollapply` manual before using the function. There are following changes: |
| 175 | + - all basic types are now supported on input/output, not only double. Users' code could possibly break if it depends on forced coercion of input/output to double type. |
| 176 | + - new argument `by.column` allowing to pass a multi-column subset of a data.table into a rolling function, closes [#4887](https://github.com/Rdatatable/data.table/issues/4887). |
| 177 | + ```r |
| 178 | + x = data.table(v1=rnorm(120), v2=rnorm(120)) |
| 179 | + f = function(x) coef(lm(v2 ~ v1, data=x)) |
| 180 | + coef.fill = c("(Intercept)"=NA_real_, "v1"=NA_real_) |
| 181 | + frollapply(x, 4, f, by.column=FALSE, fill=coef.fill) |
| 182 | + # (Intercept) v1 |
| 183 | + # 1: NA NA |
| 184 | + # 2: NA NA |
| 185 | + # 3: NA NA |
| 186 | + # 4: 0.65456931 0.3138012 |
| 187 | + # 5: -1.07977441 -2.0588094 |
| 188 | + #--- |
| 189 | + #116: 0.15828417 0.3570216 |
| 190 | + #117: -0.09083424 1.5494507 |
| 191 | + #118: -0.18345878 0.6424837 |
| 192 | + #119: -0.28964772 0.6116575 |
| 193 | + #120: -0.40598313 0.6112854 |
| 194 | + ``` |
| 195 | + - uses multiple CPU threads (on a decent OS); evaluation of UDF is inherently slow so this can be a great help. |
| 196 | + ```r |
| 197 | + x = rnorm(1e5) |
| 198 | + n = 500 |
| 199 | + setDTthreads(1) |
| 200 | + system.time( |
| 201 | + th1 <- frollapply(x, n, median, simplify=unlist) |
| 202 | + ) |
| 203 | + # user system elapsed |
| 204 | + # 3.078 0.005 3.084 |
| 205 | + setDTthreads(4) |
| 206 | + system.time( |
| 207 | + th4 <- frollapply(x, n, median, simplify=unlist) |
| 208 | + ) |
| 209 | + # user system elapsed |
| 210 | + # 2.453 0.135 0.897 |
| 211 | + all.equal(th1, th4) |
| 212 | + #[1] TRUE |
| 213 | + ``` |
| 214 | + |
| 215 | +18. New helper `frolladapt` to facilitate applying rolling functions over windows of fixed calendar-time width in irregularly-spaced data sets, thereby bypassing the need to "augment" such data with placeholder rows, [#3241](https://github.com/Rdatatable/data.table/issues/3241). Thanks to @jangorecki for implementation. |
| 216 | + ```r |
| 217 | + idx = as.Date("2025-09-05") + c(0,4,7,8,9,10,12,13,17) |
| 218 | + dt = data.table(index=idx, value=seq_along(idx)) |
| 219 | + dt |
| 220 | + # index value |
| 221 | + # <Date> <int> |
| 222 | + #1: 2025-09-05 1 |
| 223 | + #2: 2025-09-09 2 |
| 224 | + #3: 2025-09-12 3 |
| 225 | + #4: 2025-09-13 4 |
| 226 | + #5: 2025-09-14 5 |
| 227 | + #6: 2025-09-15 6 |
| 228 | + #7: 2025-09-17 7 |
| 229 | + #8: 2025-09-18 8 |
| 230 | + #9: 2025-09-22 9 |
| 231 | + dt[, c("rollmean3","rollmean3days") := list( |
| 232 | + frollmean(value, 3), |
| 233 | + frollmean(value, frolladapt(index, 3), adaptive=TRUE) |
| 234 | + )] |
| 235 | + dt |
| 236 | + # index value rollmean3 rollmean3days |
| 237 | + # <Date> <int> <num> <num> |
| 238 | + #1: 2025-09-05 1 NA NA |
| 239 | + #2: 2025-09-09 2 NA 2.0 |
| 240 | + #3: 2025-09-12 3 2 3.0 |
| 241 | + #4: 2025-09-13 4 3 3.5 |
| 242 | + #5: 2025-09-14 5 4 4.0 |
| 243 | + #6: 2025-09-15 6 5 5.0 |
| 244 | + #7: 2025-09-17 7 6 6.5 |
| 245 | + #8: 2025-09-18 8 7 7.5 |
| 246 | + #9: 2025-09-22 9 8 9.0 |
| 247 | + ``` |
| 248 | + |
72 | 249 | ### BUG FIXES |
73 | 250 |
|
74 | 251 | 1. `fread()` no longer warns on certain systems on R 4.5.0+ where the file owner can't be resolved, [#6918](https://github.com/Rdatatable/data.table/issues/6918). Thanks @ProfFancyPants for the report and PR. |
|
103 | 280 |
|
104 | 281 | 16. `between()` is now more robust with `integer64` arguments. Combining small integer `x` with certain large `integer64` bounds no longer misinterprets the bounds as `double`; if a `double` bound cannot be losslessly converted into `integer64` for comparison with `integer64` `x`, an error is signalled instead of returning a wrong answer with a warning; [#7164](https://github.com/Rdatatable/data.table/issues/7164). Thanks @aitap for the bug report and the fix. |
105 | 282 |
|
| 283 | +17. `t1 - t2`, where one is an `IDate` and the other is a `Date`, are now consistent with the case where both are `IDate` or both are `Date`, [#4749](https://github.com/Rdatatable/data.table/issues/4749). Thanks @George9000 for the report and @MichaelChirico for the fix. |
| 284 | +
|
| 285 | +18. `fwrite` now allows `dec` to be the same as `sep` for edge cases where only one will be written, e.g. 0-row or 1-column tables. [#7227](https://github.com/Rdatatable/data.table/issues/7227). Thanks @MichaelChirico for the report and @venom1204 for the fix. |
| 286 | +
|
| 287 | +19. Ellipsis elements like `..1` are correctly excluded when searching for variables in "up-a-level" syntax inside `[`, [#5460](https://github.com/Rdatatable/data.table/issues/5460). Thanks @ggrothendieck for the report and @MichaelChirico for the fix. |
| 288 | +
|
106 | 289 | ### NOTES |
107 | 290 |
|
108 | 291 | 1. The following in-progress deprecations have proceeded: |
|
0 commit comments