diff --git a/NEWS.0.md b/NEWS.0.md index 4cca3ddaf1..c232dca26c 100644 --- a/NEWS.0.md +++ b/NEWS.0.md @@ -1072,6 +1072,8 @@ query once and will never have noticed these, but anyone looping calls to grouping (such as when running in parallel, or benchmarking) may have suffered. Tests added. Thanks to many including vc273 and Y T for reporting [here](https://stackoverflow.com/questions/20349159/memory-leak-in-data-table-grouped-assignment-by-reference) and [here](https://stackoverflow.com/questions/15651515/slow-memory-leak-in-data-table-when-returning-named-lists-in-j-trying-to-reshap) on SO. 2. In long running computations where data.table is called many times repetitively the following error could sometimes occur, #2647: *"Internal error: .internal.selfref prot is not itself an extptr"*. Now fixed. Thanks to theEricStone, StevieP and JasonB for (difficult) reproducible examples [here](https://stackoverflow.com/questions/15342227/getting-a-random-internal-selfref-error-in-data-table-for-r). + for more info about internal.selfref. + 3. If `fread` returns a data error (such as no closing quote on a quoted field) it now closes the file first rather than holding a lock open, a Windows only problem. Thanks to nigmastar for reporting [here](https://stackoverflow.com/questions/18597123/fread-data-table-locks-files) and Carl Witthoft for the hint. Tests added. diff --git a/NEWS.md b/NEWS.md index 6d0e0c97f5..1f890d5535 100644 --- a/NEWS.md +++ b/NEWS.md @@ -751,6 +751,7 @@ rowwiseDT( 28. `dplyr::arrange(DT)` uses `vctrs::vec_slice` which retains `data.table`'s class but uses C to bypass `[` method dispatch and does not adjust `data.table`'s attributes containing the index row numbers, [#5042](https://github.com/Rdatatable/data.table/issues/5042). `data.table`'s long-standing `.internal.selfref` mechanism to detect such operations by other packages was not being checked by `data.table` when using indexes, causing `data.table` filters and joins to use invalid indexes and return incorrect results after a `dplyr::arrange(DT)`. Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to use `data.table` is `data.table::setkey(DT, col1, col2, ...)` which reorders `DT` by reference in parallel, sets the primary key for automatic use by subsequent `data.table` queries, and permits rowname-like usage such as `DT["foo",]` which returns the now-contiguous-in-memory block of rows where the first column of `DT`'s key contains `"foo"`. Multi-column-rownames (i.e. a primary key of more than one column) can be looked up using `DT[.("foo",20210728L), ]`. Using `==` in `i` is also optimized to use the key or indices, if you prefer using column names explicitly and `==`. An alternative to `setkey(DT)` is returning a new ordered result using `DT[order(col1, col2, ...), ]`. + 29. A segfault occurred when `nrow/throttle < nthread`, [#5077](https://github.com/Rdatatable/data.table/issues/5077). With the default throttle of 1024 rows (see `?setDTthreads`), at least 64 threads would be needed to trigger the segfault since there needed to be more than 65,535 rows too. It occurred on a server with 256 logical cores where `data.table` uses 128 threads by default. Thanks to Bennet Becker for reporting, debugging at C level, and fixing. It also occurred when the throttle was increased so as to use fewer threads; e.g. at the limit `setDTthreads(throttle=nrow(DT))`. 30. `fread(file=URL)` now works rather than error `does not exist or is non-readable`, [#4952](https://github.com/Rdatatable/data.table/issues/4952). `fread(URL)` and `fread(input=URL)` worked before and continue to work. Thanks to @pnacht for reporting and @ben-schwen for the PR. diff --git a/man/assign.Rd b/man/assign.Rd index 71e9542302..797705ce7d 100644 --- a/man/assign.Rd +++ b/man/assign.Rd @@ -92,7 +92,7 @@ Since \code{[.data.table} incurs overhead to check the existence and type of arg \value{ \code{DT} is modified by reference and returned invisibly. If you require a copy, take a \code{\link{copy}} first (using \code{DT2 = copy(DT)}). } -\seealso{ \code{\link{data.table}}, \code{\link{copy}}, \code{\link{setalloccol}}, \code{\link{truelength}}, \code{\link{set}}, \code{\link{.Last.updated}} +\seealso{ \code{\link{data.table}}, \code{\link{copy}}, \code{\link{setalloccol}}, \code{\link{truelength}}, \code{\link{set}}, \code{\link{.Last.updated}},\code{\link{.internal.selfref}} } \examples{ DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7) diff --git a/man/copy.Rd b/man/copy.Rd index 587f216805..c864486bd2 100644 --- a/man/copy.Rd +++ b/man/copy.Rd @@ -24,7 +24,7 @@ A \code{copy()} may be required when doing \code{dt_names = names(DT)}. Due to R Returns a copy of the object. } \seealso{ - \code{\link{data.table}}, \code{\link{address}}, \code{\link{setkey}}, \code{\link{setDT}}, \code{\link{setDF}}, \code{\link{set}} \code{\link{:=}}, \code{\link{setorder}}, \code{\link{setattr}}, \code{\link{setnames}} + \code{\link{data.table}}, \code{\link{address}}, \code{\link{setkey}}, \code{\link{setDT}}, \code{\link{setDF}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setorder}}, \code{\link{setattr}}, \code{\link{setnames}}, \code{\link{.internal.selfref}} } \examples{ # Type 'example(copy)' to run these at prompt and browse output diff --git a/man/data.table-class.Rd b/man/data.table-class.Rd index bce3307715..0d870e7243 100644 --- a/man/data.table-class.Rd +++ b/man/data.table-class.Rd @@ -17,9 +17,8 @@ \author{ Steve Lianoglou } \seealso{ - \code{\link{data.table}} + \code{\link{data.table}}, \code{\link{tables}}, \code{\link{J}}, \code{\link[base:order]{sort.list}}, \code{\link{copy}}, \code{\link{setDT}}, \code{\link{setDF}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setorder}}, \code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}}, \code{\link{chorder}}, \code{\link{setNumericRounding}}, \code{\link{.internal.selfref}} } - \examples{ ## Used in inheritance. setClass('SuperDataTable', contains='data.table') diff --git a/man/datatable-optimize.Rd b/man/datatable-optimize.Rd index 9ce7f308fc..4f681983e6 100644 --- a/man/datatable-optimize.Rd +++ b/man/datatable-optimize.Rd @@ -102,7 +102,9 @@ Auto indexing can be switched off with the global option \code{options(datatable.auto.index = FALSE)}. To switch off using existing indices set global option \code{options(datatable.use.index = FALSE)}. } -\seealso{ \code{\link{setNumericRounding}}, \code{\link{getNumericRounding}} } +\seealso{ + \code{\link{setNumericRounding}}, \code{\link{getNumericRounding}}, \code{\link{.internal.selfref}} +} \examples{ \dontrun{ old = options(datatable.optimize = Inf) diff --git a/man/internal.selfref.Rd b/man/internal.selfref.Rd new file mode 100644 index 0000000000..c37a72fe36 --- /dev/null +++ b/man/internal.selfref.Rd @@ -0,0 +1,44 @@ +\name{.internal.selfref} +\alias{.internal.selfref} +\title{Internal Self-Reference Attribute in data.table} +\description{ + The \code{.internal.selfref} attribute is an internal mechanism used by \code{data.table} to optimize memory management and performance. It acts as a pointer that allows \code{data.table} objects to reference their own memory location. While the \code{.internal.selfref} attribute may appear to always point to \code{NULL} when inspected directly, this is a result of its implementation in R's memory management system. The true significance of this attribute lies in its role in supporting reference semantics, which enables efficient in-place modification of \code{data.table} objects without unnecessary copying. + + The \code{.internal.selfref} attribute is deliberately structured so that \code{identical()} checks return \code{TRUE} for two \code{data.table} objects with identical contents, even when their attributes point to the same memory address. This behavior is achieved by storing the actual self-reference pointer in the \code{prot} part of an external pointer, wrapped in another external pointer to avoid creating visible reference loops. When a \code{data.table} is duplicated, its memory address changes, making it possible to detect the copy and handle it accordingly. +} +\details{ + The \code{.internal.selfref} attribute is a pointer that ensures that \code{data.table} objects can be modified by reference without redundant memory allocation. This avoids copying when performing in-place modifications such as adding or updating columns, filtering rows, or performing joins. + + Key details about the \code{.internal.selfref} attribute: + \itemize{ + \item \code{p=NULL} is used instead of \code{R_NilValue}, allowing \code{data.table} to detect objects loaded from disk and ensure correct behavior. + \item Wrapping the self-reference in another external pointer prevents infinite loops during \code{object.size} calculations. + \item If the attribute is removed or corrupted, the next operation involving \code{:=} triggers a warning and creates a new self-reference after copying. + } + + The \code{_selfrefok} function verifies the validity of the \code{.internal.selfref} attribute. It checks whether the attribute correctly references the current \code{data.table} object by comparing memory addresses. If the attribute is invalidated (e.g., due to duplication or corruption), \code{_selfrefok} triggers a repair mechanism to restore reference semantics, ensuring that in-place operations remain efficient. +} +\value{ + The \code{.internal.selfref} attribute is an internal implementation detail and does not produce a value that users would typically interact with. It is invisible during regular \code{data.table} operations. +} +\seealso{ + \code{\link{data.table}}, \code{\link{setkey}}, \code{\link{merge}}, \code{\link{[.data.table}} +} +\examples{ + # Create a data.table + dt <- data.table(A = 1:5, B = letters[1:5]) + + # Trace memory to check for reference semantics + tracemem(dt) # Outputs the memory address of the data.table + + # Perform an in-place operation + dt[, C := A * 2] # Add a new column in place + + # Verify no copying has occurred + # (The output of tracemem should show no memory change) + + # Example of losing .internal.selfref (hypothetical, for illustration) + dt_copy <- copy(dt) # Copy the data.table + .Internal(inspect(dt_copy)) # Shows .internal.selfref attribute no longer matches +} +\keyword{internal} diff --git a/man/setDT.Rd b/man/setDT.Rd index 9311d0e3b1..6a0797cfeb 100644 --- a/man/setDT.Rd +++ b/man/setDT.Rd @@ -25,7 +25,8 @@ setDT(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) The input is modified by reference, and returned (invisibly) so it can be used in compound statements; e.g., \code{setDT(X)[, sum(B), by=A]}. If you require a copy, take a copy first (using \code{DT2 = copy(DT)}). See \code{?copy}. } -\seealso{ \code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setkey}}, \code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setorder}} +\seealso{ + \code{\link[base]{transform}}, \code{\link[base:with]{within}}, \code{\link{:=}}, \code{\link{.internal.selfref}} } \examples{ diff --git a/man/setkey.Rd b/man/setkey.Rd index b8c9b1ce8f..a075d3deb1 100644 --- a/man/setkey.Rd +++ b/man/setkey.Rd @@ -107,11 +107,8 @@ reference. \url{https://cran.r-project.org/package=bit64}\cr \url{https://github.com/Rdatatable/data.table/wiki/Presentations} } -\seealso{ \code{\link{data.table}}, \code{\link{tables}}, \code{\link{J}}, -\code{\link[base:order]{sort.list}}, \code{\link{copy}}, \code{\link{setDT}}, -\code{\link{setDF}}, \code{\link{set}} \code{\link{:=}}, \code{\link{setorder}}, -\code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}}, -\code{\link{chorder}}, \code{\link{setNumericRounding}} +\seealso{ + \code{\link{setkey}}, \code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setDT}}, \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setNumericRounding}}, \code{\link{.internal.selfref}} } \examples{ # Type 'example(setkey)' to run these at the prompt and browse output diff --git a/man/setorder.Rd b/man/setorder.Rd index e1cdc40bba..945c016e0d 100644 --- a/man/setorder.Rd +++ b/man/setorder.Rd @@ -111,9 +111,7 @@ If you require a copy, take a copy first (using \code{DT2 = copy(DT)}). See \url{https://medium.com/basecs/getting-to-the-root-of-sorting-with-radix-sort-f8e9240d4224} } \seealso{ - \code{\link{setkey}}, \code{\link{setcolorder}}, \code{\link{setattr}}, - \code{\link{setnames}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setDT}}, - \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setNumericRounding}} + \code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setkey}}, \code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setorder}}, \code{\link{.internal.selfref}} } \examples{ diff --git a/man/transform.data.table.Rd b/man/transform.data.table.Rd index 1bf6fb551d..6225b81057 100644 --- a/man/transform.data.table.Rd +++ b/man/transform.data.table.Rd @@ -33,7 +33,9 @@ columns that appear in \ldots) are not in the key of the data.table. \value{ The modified value of a copy of \code{data}. } -\seealso{ \code{\link[base]{transform}}, \code{\link[base:with]{within}} and \code{\link{:=}} } +\seealso{ + \code{\link{data.table}}, \code{\link{setDT}}, \code{\link{setDF}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{.internal.selfref}} +} \examples{ DT <- data.table(a=rep(1:3, each=2), b=1:6)