You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Slow="548410d23dd74b625e8ea9aeb1a5d2e9dddd2927", # Parent of the first commit in the PR (https://github.com/Rdatatable/data.table/commit/548410d23dd74b625e8ea9aeb1a5d2e9dddd2927)
287
287
Fast="c0b32a60466bed0e63420ec105bc75c34590865e"), # Commit in the PR (https://github.com/Rdatatable/data.table/pull/7144/commits) that uses a much faster implementation
288
288
289
-
tests=extra.test.list)
289
+
# Regression introduced in #7404 (grouped by factor).
290
+
"DT[by] max regression fixed in #7480"=atime::atime_test(
Rscript -e 'l=tail(readLines("data.table.Rcheck/00check.log"), 1L); notes<-"Status: 2 NOTEs"; if (!identical(l, notes)) stop("Last line of ", shQuote("00check.log"), " is not ", shQuote(notes), " (non-API calls, V8 package) but ", shQuote(l)) else q("no")'
186
+
Rscript -e 'l=tail(readLines("data.table.Rcheck/00check.log"), 1L); notes<-"Status: 1 NOTE"; if (!identical(l, notes)) stop("Last line of ", shQuote("00check.log"), " is not ", shQuote(notes), " (V8 package) but ", shQuote(l)) else q("no")'
187
187
188
188
## R-devel on Linux clang
189
189
# R compiled with clang, flags removed: -flto=auto -fopenmp
@@ -206,7 +206,7 @@ test-lin-dev-clang-cran:
206
206
- R CMD check --as-cran $(ls -1t data.table_*.tar.gz | head -n 1)
Rscript -e 'l=tail(readLines("data.table.Rcheck/00check.log"), 1L); notes<-"Status: 2 NOTEs"; if (!identical(l, notes)) stop("Last line of ", shQuote("00check.log"), " is not ", shQuote(notes), " (non-API calls, V8 package) but ", shQuote(l)) else q("no")'
209
+
Rscript -e 'l=tail(readLines("data.table.Rcheck/00check.log"), 1L); notes<-"Status: 1 NOTE"; if (!identical(l, notes)) stop("Last line of ", shQuote("00check.log"), " is not ", shQuote(notes), " (V8 package) but ", shQuote(l)) else q("no")'
Copy file name to clipboardExpand all lines: NEWS.md
+10-2Lines changed: 10 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -338,7 +338,7 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
338
338
339
339
19. Ellipsis elements like `..1` are correctly excluded when searching for variables in "up-a-level" syntax inside `[`, [#5460](https://github.com/Rdatatable/data.table/issues/5460). Thanks @ggrothendieck for the report and @MichaelChirico for the fix.
340
340
341
-
20. `forderv` could segfault on keys with long runs of identical bytes (e.g., many duplicate columns) because the single-group branch tail-recursed radix-by-radix until the C stack ran out, [#4300](https://github.com/Rdatatable/data.table/issues/4300). This is a major problem since sorting is extensively used in `data.table`. Thanks @quantitative-technologies for the report and @ben-schwen for the fix.
341
+
20. `forderv` could segfault on keys with long runs of identical bytes because the single-group branch tail-recursed radix-by-radix until the C stack ran out. This affected both integer/numeric sorting with many duplicate columns ([#4300](https://github.com/Rdatatable/data.table/issues/4300)) and character sorting with long common prefixes ([#7462](https://github.com/Rdatatable/data.table/issues/7462)). This is a major problem since sorting is extensively used in `data.table`. Thanks @quantitative-technologies and @DavisVaughan for the reports, and @ben-schwen for the fix.
342
342
343
343
21. `[` now preserves existing key(s) when new columns are added before them, instead of incorrectly setting a new column as key, [#7364](https://github.com/Rdatatable/data.table/issues/7364). Thanks @czeildi for the bug report and the fix.
344
344
@@ -350,7 +350,11 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
350
350
351
351
25.By-groupoperationsonmissing rows (e.g.`foo[c(i, NA), bar, by=grp]`) nowavoidleavingindatafromthepreviousgroups, [#7442](https://github.com/Rdatatable/data.table/issues/7442). Thanks @aitap for the report and the fix.
352
352
353
-
26.`rbindlist()`nowavoidsthecrashwhenworkingwithmanynon-UTF-8columnnames, [#7452](https://github.com/Rdatatable/data.table/issues/7452). Thanks @aitap for the report and the fix.
353
+
26.Groupingbyafactorwithmanygroupsisnowfastagain, fixingatimingregressionintroducedin [#6890](https://github.com/Rdatatable/data.table/pull/6890) where UTF-8 coercion and level remapping were performed unnecessarily, [#7404](https://github.com/Rdatatable/data.table/issues/7404). Thanks @ben-schwen for the report and fix.
354
+
355
+
27.`dogroups()`nolongerreadsbeyondtheresizedendofover-allocateddata.tablelistcolumns, [#7486](https://github.com/Rdatatable/data.table/issues/7486). While this didn't crash in practice, it is now explicitly checked for in recent R versions (r89198+). Thanks @TimTaylor and @aitap for the report and @aitap for the fix.
356
+
357
+
28.`rbindlist()`nowavoidsthecrashwhenworkingwithmanynon-UTF-8columnnames, [#7452](https://github.com/Rdatatable/data.table/issues/7452). Thanks @aitap for the report and the fix.
354
358
355
359
### NOTES
356
360
@@ -379,6 +383,8 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
379
383
380
384
8. Retain important information in the error message about the source of the error when `i=` fails, e.g. pointing to `charToDate()` failing in `DT[date_col == "20250101"]`, [#7444](https://github.com/Rdatatable/data.table/issues/7444). Thanks @jan-swissre for the report and @MichaelChirico for the fix.
381
385
386
+
9. Internal use of declared non-API R functions `SETLENGTH`, `TRUELENGTH`, `SET_TRUELENGTH`, and `SET_GROWABLE_BIT` has been eliminated. Most usages have been migrated to R's experimental resizable vectors API (thanks to @ltierney, introduced in R 4.6.0, backported for older R versions), [#7451](https://github.com/Rdatatable/data.table/pull/7451). Uses of `TRUELENGTH` for marking seen items during grouping and binding operations (aka free hash table trick) have been replaced with proper hash tables, [#6694](https://github.com/Rdatatable/data.table/pull/6694). The new hash table implementation uses linear probing with power of 2 tables and automatic resizing. Additionally, `chmatch()` now hashes the needle (`x`) instead of the haystack (`table`) when `length(table) >> length(x)`, significantly improving performance for lookups into large tables. We've benchmarked the refactored code and find the performance satisfactory, but please do report any edge case performance regressions we may have missed. Thanks to @aitap, @ben-schwen, @jangorecki and @HughParsonage for implementation and reviews.
387
+
382
388
## data.table [v1.17.8](https://github.com/Rdatatable/data.table/milestone/41) (6 July 2025)
383
389
384
390
1. Internal functions used to signal errors are now marked as non-returning, silencing a compiler warning about potentially unchecked allocation failure. Thanks to Prof. Brian D. Ripley for the report and @aitap for the fix, [#7070](https://github.com/Rdatatable/data.table/pull/7070).
@@ -552,6 +558,8 @@ rowwiseDT(
552
558
553
559
22.`fread()`couldfailtoreadMacCSV files (with`\r`lineendings) ifthefilecontainedany`\n`character, suchasafinal`\r\n`.Thiswasfixedbydetectingthepredominantlineendinginasampleofthefile, [#4186](https://github.com/Rdatatable/data.table/issues/4186). Thanks to @MPagel for the report and @ben-schwen for the fix.
554
560
561
+
23.Byreference assignments (':=') withfunctionsthatmodifiedthedata.tablebyreference e.g. (`foo=function(DT){modify(DT);return(1L)}`, `DT[,a:=foo(DT)]`) returnedamalformeddata.tableduetothemodificationofthetargetednamedcolumn index ("a") duringthejexpressionevaluation [#6768](https://github.com/Rdatatable/data.table/issues/6768). Thanks @AntonNM for the report and fix.
# Adding new column(s). TO DO: move after the first eval in case the jsub has an error.
1217
+
# Adding new column(s). Allocation for columns and recalculation of target cols moved after the jval = eval(jsub)
1218
+
# in case of error or by-reference modifications to the DT
1218
1219
newnames=setdiff(lhs, names_x)
1219
1220
m[is.na(m)] = ncol(x)+seq_along(newnames)
1220
1221
cols= as.integer(m)
1221
1222
# don't pass verbose to selfrefok here -- only activated when
1222
-
# ok=-1 which will trigger setalloccol with verbose in the next
1223
-
#branch, which again calls _selfrefok and returns the message then
1224
-
if ((ok<-selfrefok(x, verbose=FALSE))==0L) # ok==0 so no warning when loaded from disk (-1) [-1 considered TRUE by R]
1223
+
# ok=-1 which will trigger setalloccol with verbose after
1224
+
#the jval = eval(jsub, ...)
1225
+
if (ok==0L) # ok==0 so no warning when loaded from disk (-1) [-1 considered TRUE by R]
1225
1226
if (is.data.table(x)) warningf("A shallow copy of this data.table was taken so that := can add or remove %d columns by reference. At an earlier point, this data.table was copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. It's also not unusual for data.table-agnostic packages to produce tables affected by this issue. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.", length(newnames))
1226
-
# !is.data.table for DF |> DT(,:=) tests 2212.16-19 (#5113) where a shallow copy is routine for data.frame
1227
-
if ((ok<1L) || (truelength(x) < ncol(x)+length(newnames))) {
1228
-
DT=x# in case getOption contains "ncol(DT)" as it used to. TODO: warn and then remove
1229
-
n= length(newnames) + eval(getOption("datatable.alloccol")) # TODO: warn about expressions and then drop the eval()
1230
-
# i.e. reallocate at the size as if the new columns were added followed by setalloccol().
1231
-
name= substitute(x)
1232
-
if (is.name(name) &&ok&&verbose) { # && NAMED(x)>0 (TO DO) # ok here includes -1 (loaded from disk)
1233
-
catf("Growing vector of column pointers from truelength %d to %d. A shallow copy has been taken, see ?setalloccol. Only a potential issue if two variables point to the same data (we can't yet detect that well) and if not you can safely ignore this. To avoid this message you could setalloccol() first, deep copy first using copy(), wrap with suppressWarnings() or increase the 'datatable.alloccol' option.\n", truelength(x), n)
1234
-
# #1729 -- copying to the wrong environment here can cause some confusion
1235
-
if (ok==-1L) catf("Note that the shallow copy will assign to the environment from which := was called. That means for example that if := was called within a function, the original table may be unaffected.\n")
1236
-
1237
-
# Verbosity should not issue warnings, so cat rather than warning.
1238
-
# TO DO: Add option 'datatable.pedantic' to turn on warnings like this.
1239
-
1240
-
# TO DO ... comments moved up from C ...
1241
-
# Note that the NAMED(dt)>1 doesn't work because .Call
1242
-
# always sets to 2 (see R-ints), it seems. Work around
1243
-
# may be possible but not yet working. When the NAMED test works, we can drop allocwarn argument too
1244
-
# because that's just passed in as FALSE from [<- where we know `*tmp*` isn't really NAMED=2.
1245
-
# Note also that this growing will happen for missing columns assigned NULL, too. But so rare, we
1246
-
# don't mind.
1247
-
}
1248
-
setalloccol(x, n, verbose=verbose) # always assigns to calling scope; i.e. this scope
(truelength(x) < ncol(x)+length(newnames)) # not enough space for new columns
1414
+
)
1415
+
) {
1416
+
DT=x# in case getOption contains "ncol(DT)" as it used to. TODO: warn and then remove
1417
+
n= length(newnames) + eval(getOption("datatable.alloccol")) # TODO: warn about expressions and then drop the eval()
1418
+
# i.e. reallocate at the size as if the new columns were added followed by setalloccol().
1419
+
name= substitute(x)
1420
+
if (is.name(name) &&ok&&verbose) { # && NAMED(x)>0 (TO DO) # ok here includes -1 (loaded from disk)
1421
+
catf("Growing vector of column pointers from truelength %d to %d. A shallow copy has been taken, see ?setalloccol. Only a potential issue if two variables point to the same data (we can't yet detect that well) and if not you can safely ignore this. To avoid this message you could setalloccol() first, deep copy first using copy(), wrap with suppressWarnings() or increase the 'datatable.alloccol' option.\n", truelength(x), n)
1422
+
# #1729 -- copying to the wrong environment here can cause some confusion
1423
+
if (ok==-1L) catf("Note that the shallow copy will assign to the environment from which := was called. That means for example that if := was called within a function, the original table may be unaffected.\n")
1424
+
1425
+
# Verbosity should not issue warnings, so cat rather than warning.
1426
+
# TO DO: Add option 'datatable.pedantic' to turn on warnings like this.
1427
+
1428
+
# TO DO ... comments moved up from C ...
1429
+
# Note that the NAMED(dt)>1 doesn't work because .Call
1430
+
# always sets to 2 (see R-ints), it seems. Work around
1431
+
# may be possible but not yet working. When the NAMED test works, we can drop allocwarn argument too
1432
+
# because that's just passed in as FALSE from [<- where we know `*tmp*` isn't really NAMED=2.
1433
+
# Note also that this growing will happen for missing columns assigned NULL, too. But so rare, we
1434
+
# don't mind.
1435
+
}
1436
+
setalloccol(x, n, verbose=verbose) # always assigns to calling scope; i.e. this scope
# TODO add: if (max(len__)==nrow) stopf("There is no need to deep copy x in this case")
1593
1622
# TODO move down to dogroup.c, too.
1594
-
SDenv$.SDall= .Call(CsubsetDT, x, if (length(len__)) seq_len(max(len__)) else0L, xcols) # must be deep copy when largest group is a subset
1623
+
SDenv$.SDall= .Call(CcopyAsGrowable, .Call(CsubsetDT, x, if (length(len__)) seq_len(max(len__)) else0L, xcols)) # must be deep copy when largest group is a subset
1595
1624
if (!is.data.table(SDenv$.SDall)) setattr(SDenv$.SDall, "class", c("data.table","data.frame")) # DF |> DT(,.SD[...],by=grp) needs .SD to be data.table, test 2022.012
1596
1625
if (xdotcols) setattr(SDenv$.SDall, 'names', ansvars[xcolsAns]) # now that we allow 'x.' prefix in 'j', #2313 bug fix - [xcolsAns]
0 commit comments