Skip to content
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -1072,6 +1072,8 @@
query once and will never have noticed these, but anyone looping calls to grouping (such as when running in parallel, or benchmarking) may have suffered. Tests added. Thanks to many including vc273 and Y T for reporting [here](https://stackoverflow.com/questions/20349159/memory-leak-in-data-table-grouped-assignment-by-reference) and [here](https://stackoverflow.com/questions/15651515/slow-memory-leak-in-data-table-when-returning-named-lists-in-j-trying-to-reshap) on SO.
2. In long running computations where data.table is called many times repetitively the following error could sometimes occur, #2647: *"Internal error: .internal.selfref prot is not itself an extptr"*. Now fixed. Thanks to theEricStone, StevieP and JasonB for (difficult) reproducible examples [here](https://stackoverflow.com/questions/15342227/getting-a-random-internal-selfref-error-in-data-table-for-r).
for more info about internal.selfref Refer to [internal.selfref](../man/internal.selfref.Rd) for additional information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this one.

3. If `fread` returns a data error (such as no closing quote on a quoted field) it now closes the file first rather than holding a lock open, a Windows only problem.
Thanks to nigmastar for reporting [here](https://stackoverflow.com/questions/18597123/fread-data-table-locks-files) and Carl Witthoft for the hint. Tests added.
Expand Down
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -744,6 +744,7 @@ rowwiseDT(
27. `as.data.frame(DT)`, `setDF(DT)` and `as.list(DT)` now remove the `"index"` attribute which contains any indices (a.k.a. secondary keys), as they already did for other `data.table`-only attributes such as the primary key stored in the `"sorted"` attribute. When indices were left intact, a subsequent subset, assign, or reorder of the `data.frame` by `data.frame`-code in base R or other packages would not update the indices, causing incorrect results if then converted back to `data.table`, [#4889](https://github.com/Rdatatable/data.table/issues/4889). Thanks @OfekShilon for the report and the PR.

28. `dplyr::arrange(DT)` uses `vctrs::vec_slice` which retains `data.table`'s class but uses C to bypass `[` method dispatch and does not adjust `data.table`'s attributes containing the index row numbers, [#5042](https://github.com/Rdatatable/data.table/issues/5042). `data.table`'s long-standing `.internal.selfref` mechanism to detect such operations by other packages was not being checked by `data.table` when using indexes, causing `data.table` filters and joins to use invalid indexes and return incorrect results after a `dplyr::arrange(DT)`. Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to use `data.table` is `data.table::setkey(DT, col1, col2, ...)` which reorders `DT` by reference in parallel, sets the primary key for automatic use by subsequent `data.table` queries, and permits rowname-like usage such as `DT["foo",]` which returns the now-contiguous-in-memory block of rows where the first column of `DT`'s key contains `"foo"`. Multi-column-rownames (i.e. a primary key of more than one column) can be looked up using `DT[.("foo",20210728L), ]`. Using `==` in `i` is also optimized to use the key or indices, if you prefer using column names explicitly and `==`. An alternative to `setkey(DT)` is returning a new ordered result using `DT[order(col1, col2, ...), ]`.
Refer to [internal.selfref](../man/internal.selfref.Rd) for additional information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also need to be reverted.


29. A segfault occurred when `nrow/throttle < nthread`, [#5077](https://github.com/Rdatatable/data.table/issues/5077). With the default throttle of 1024 rows (see `?setDTthreads`), at least 64 threads would be needed to trigger the segfault since there needed to be more than 65,535 rows too. It occurred on a server with 256 logical cores where `data.table` uses 128 threads by default. Thanks to Bennet Becker for reporting, debugging at C level, and fixing. It also occurred when the throttle was increased so as to use fewer threads; e.g. at the limit `setDTthreads(throttle=nrow(DT))`.

Expand Down
4 changes: 2 additions & 2 deletions man/assign.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,12 @@ Since \code{[.data.table} incurs overhead to check the existence and type of arg
\value{
\code{DT} is modified by reference and returned invisibly. If you require a copy, take a \code{\link{copy}} first (using \code{DT2 = copy(DT)}).
}
\seealso{ \code{\link{data.table}}, \code{\link{copy}}, \code{\link{setalloccol}}, \code{\link{truelength}}, \code{\link{set}}, \code{\link{.Last.updated}}
\seealso{ \code{\link{data.table}}, \code{\link{copy}}, \code{\link{setalloccol}}, \code{\link{truelength}}, \code{\link{set}}, \code{\link{.Last.updated}},\code{\link{internal.selfref}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When linking to a help page, use the \alias{...} name of the help page (i.e. .internal.selfref here), not its file name (i.e. not internal.selfref). All other occurrences will need to be fixed too.

}
\examples{
DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7)
DT[, c := 8] # add a numeric column, 8 for all rows
DT[, d := 9L] # add an integer column, 9L for all rows
DT[, d := 9L] # add an integer column, 9L for all rows\code{\link{.Last.updated}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax doesn't work inside \examples. Unfortunately, the Rd format has a lot of little details. Try rendering the help pages using R CMD Rdconv or installing the package and browsing its help pages before pushing commits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanx for the guidance i will introduce all the changes and submit it till toady or tomorrow evening

DT[, c := NULL] # remove column c
DT[2, d := -8L] # subassign by reference to d; 2nd row is -8L now
DT # DT changed by reference
Expand Down
2 changes: 1 addition & 1 deletion man/copy.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ A \code{copy()} may be required when doing \code{dt_names = names(DT)}. Due to R
Returns a copy of the object.
}
\seealso{
\code{\link{data.table}}, \code{\link{address}}, \code{\link{setkey}}, \code{\link{setDT}}, \code{\link{setDF}}, \code{\link{set}} \code{\link{:=}}, \code{\link{setorder}}, \code{\link{setattr}}, \code{\link{setnames}}
\code{\link{data.table}}, \code{\link{address}}, \code{\link{setkey}}, \code{\link{setDT}}, \code{\link{setDF}}, \code{\link{set}} \code{\link{:=}}, \code{\link{setorder}}, \code{\link{setattr}}, \code{\link{setnames}},\code{\link{internal.selfref}}
}
\examples{
# Type 'example(copy)' to run these at prompt and browse output
Expand Down
2 changes: 1 addition & 1 deletion man/data.table-class.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

\author{ Steve Lianoglou }
\seealso{
\code{\link{data.table}}
\code{\link{data.table}},\code{\link{internal.selfref}}
}

\examples{
Expand Down
2 changes: 1 addition & 1 deletion man/datatable-optimize.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ Auto indexing can be switched off with the global option
\code{options(datatable.auto.index = FALSE)}. To switch off using existing
indices set global option \code{options(datatable.use.index = FALSE)}.
}
\seealso{ \code{\link{setNumericRounding}}, \code{\link{getNumericRounding}} }
\seealso{ \code{\link{setNumericRounding}}, \code{\link{getNumericRounding}},\code{\link{internal.selfref}} }
\examples{
\dontrun{
old = options(datatable.optimize = Inf)
Expand Down
37 changes: 37 additions & 0 deletions man/internal.selfref.rd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with other help files, let's rename the file to internal.selfref.Rd, with capital R in the extension.

Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
\name{.internal.selfref}
\alias{.internal.selfref}
\title{Internal Self-Reference Attribute in data.table}
\description{
The \code{.internal.selfref} attribute is an internal mechanism used by \code{data.table} to optimize memory management and performance. It acts as a pointer that allows \code{data.table} objects to reference their own memory location. While the \code{.internal.selfref} attribute may appear to always point to \code{NULL} when inspected directly, this is a result of its implementation in R's memory management system. The true significance of this attribute lies in its role in supporting reference semantics, which enables efficient in-place modification of \code{data.table} objects without unnecessary copying.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that the external pointer appears to point at NULL; that's done deliberately so that identical() would be TRUE for such attributes from two different data.tables. (Why is that needed? So that identical() would work on data.tables with otherwise identical contents.) The actual self-reference pointer lives in the prot part of the external pointer, visible using .Internal(inspect(x)), wrapped into another external pointer to avoid creating a reference loop visible to R. Once the data.table is duplicated, its address will be different from what is stored in the self-reference attribute, making it possible to detect such copies.

Try summarising the comments and the code of _selfrefok to document what it actually does.

}
\details{
The \code{.internal.selfref} attribute is a pointer that ensures that \code{data.table} objects can be modified by reference without redundant memory allocation. This avoids copying when performing in-place modifications such as adding or updating columns, filtering rows, or performing joins.
While the \code{.internal.selfref} attribute may appear to always point to \code{NULL} when inspected directly, it plays a crucial role in optimizing performance by enabling reference semantics. When the \code{.internal.selfref} is intact, operations on the \code{data.table} can be done efficiently in place. However, if the attribute is lost or corrupted (due to operations that break reference semantics), \code{data.table} reverts to default \code{data.frame}-like behavior, which can result in copying and slower performance.
Users generally do not need to interact directly with \code{.internal.selfref}, but understanding its purpose can be helpful when debugging issues related to memory usage or unexpected copying behavior.
\code{.internal.selfref} is automatically managed by \code{data.table} and is not intended to be modified by users.
}
\value{
The \code{.internal.selfref} attribute is an internal implementation detail and does not produce a value that users would typically interact with. It is invisible during regular \code{data.table} operations.
}
\seealso{
\code{\link{data.table}}, \code{\link{setkey}}, \code{\link{merge}}, \code{\link{[.data.table}}
}
\examples{
library(data.table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add library(data.table) to the examples. It's okay to assume that the package is already attached. Both R CMD check and example() will attach the package for us.

# Create a data.table
dt <- data.table(A = 1:5, B = letters[1:5])
# Trace memory to check for reference semantics
tracemem(dt) # Outputs the memory address of the data.table
# Perform an in-place operation
dt[, C := A * 2] # Add a new column in place
# Verify no copying has occurred
# (The output of tracemem should show no memory change)
}
\keyword{internal}
2 changes: 1 addition & 1 deletion man/setDT.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ setDT(x, keep.rownames=FALSE, key=NULL, check.names=FALSE)
The input is modified by reference, and returned (invisibly) so it can be used in compound statements; e.g., \code{setDT(X)[, sum(B), by=A]}. If you require a copy, take a copy first (using \code{DT2 = copy(DT)}). See \code{?copy}.
}

\seealso{ \code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setkey}}, \code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setorder}}
\seealso{ \code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setkey}}, \code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setorder}},\code{\link{internal.selfref}}
}
\examples{

Expand Down
2 changes: 1 addition & 1 deletion man/setkey.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ reference.
\code{\link[base:order]{sort.list}}, \code{\link{copy}}, \code{\link{setDT}},
\code{\link{setDF}}, \code{\link{set}} \code{\link{:=}}, \code{\link{setorder}},
\code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}},
\code{\link{chorder}}, \code{\link{setNumericRounding}}
\code{\link{chorder}}, \code{\link{setNumericRounding}},\code{\link{internal.selfref}}
}
\examples{
# Type 'example(setkey)' to run these at the prompt and browse output
Expand Down
3 changes: 2 additions & 1 deletion man/setorder.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,8 @@ If you require a copy, take a copy first (using \code{DT2 = copy(DT)}). See
\seealso{
\code{\link{setkey}}, \code{\link{setcolorder}}, \code{\link{setattr}},
\code{\link{setnames}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setDT}},
\code{\link{setDF}}, \code{\link{copy}}, \code{\link{setNumericRounding}}
\code{\link{setDF}}, \code{\link{copy}}, \code{\link{setNumericRounding}},
\code{\link{internal.selfref}}
}
\examples{

Expand Down
2 changes: 1 addition & 1 deletion man/transform.data.table.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ columns that appear in \ldots) are not in the key of the data.table.
\value{
The modified value of a copy of \code{data}.
}
\seealso{ \code{\link[base]{transform}}, \code{\link[base:with]{within}} and \code{\link{:=}} }
\seealso{ \code{\link[base]{transform}}, \code{\link[base:with]{within}} and \code{\link{:=}},\code{\link{internal.selfref}} }
\examples{
DT <- data.table(a=rep(1:3, each=2), b=1:6)

Expand Down
Loading