-
Notifications
You must be signed in to change notification settings - Fork 1k
closes #4519 added a reference page for .internal.selfref #6696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
ec121fb
9e32585
8f469ce
5c1b813
fbe4a77
521ec12
b3efcdb
99e7898
9ccbede
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -744,6 +744,7 @@ rowwiseDT( | |
| 27. `as.data.frame(DT)`, `setDF(DT)` and `as.list(DT)` now remove the `"index"` attribute which contains any indices (a.k.a. secondary keys), as they already did for other `data.table`-only attributes such as the primary key stored in the `"sorted"` attribute. When indices were left intact, a subsequent subset, assign, or reorder of the `data.frame` by `data.frame`-code in base R or other packages would not update the indices, causing incorrect results if then converted back to `data.table`, [#4889](https://github.com/Rdatatable/data.table/issues/4889). Thanks @OfekShilon for the report and the PR. | ||
|
|
||
| 28. `dplyr::arrange(DT)` uses `vctrs::vec_slice` which retains `data.table`'s class but uses C to bypass `[` method dispatch and does not adjust `data.table`'s attributes containing the index row numbers, [#5042](https://github.com/Rdatatable/data.table/issues/5042). `data.table`'s long-standing `.internal.selfref` mechanism to detect such operations by other packages was not being checked by `data.table` when using indexes, causing `data.table` filters and joins to use invalid indexes and return incorrect results after a `dplyr::arrange(DT)`. Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to use `data.table` is `data.table::setkey(DT, col1, col2, ...)` which reorders `DT` by reference in parallel, sets the primary key for automatic use by subsequent `data.table` queries, and permits rowname-like usage such as `DT["foo",]` which returns the now-contiguous-in-memory block of rows where the first column of `DT`'s key contains `"foo"`. Multi-column-rownames (i.e. a primary key of more than one column) can be looked up using `DT[.("foo",20210728L), ]`. Using `==` in `i` is also optimized to use the key or indices, if you prefer using column names explicitly and `==`. An alternative to `setkey(DT)` is returning a new ordered result using `DT[order(col1, col2, ...), ]`. | ||
| Refer to [internal.selfref](../man/internal.selfref.Rd) for additional information. | ||
|
||
|
|
||
| 29. A segfault occurred when `nrow/throttle < nthread`, [#5077](https://github.com/Rdatatable/data.table/issues/5077). With the default throttle of 1024 rows (see `?setDTthreads`), at least 64 threads would be needed to trigger the segfault since there needed to be more than 65,535 rows too. It occurred on a server with 256 logical cores where `data.table` uses 128 threads by default. Thanks to Bennet Becker for reporting, debugging at C level, and fixing. It also occurred when the throttle was increased so as to use fewer threads; e.g. at the limit `setDTthreads(throttle=nrow(DT))`. | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -92,12 +92,12 @@ Since \code{[.data.table} incurs overhead to check the existence and type of arg | |
| \value{ | ||
| \code{DT} is modified by reference and returned invisibly. If you require a copy, take a \code{\link{copy}} first (using \code{DT2 = copy(DT)}). | ||
| } | ||
| \seealso{ \code{\link{data.table}}, \code{\link{copy}}, \code{\link{setalloccol}}, \code{\link{truelength}}, \code{\link{set}}, \code{\link{.Last.updated}} | ||
| \seealso{ \code{\link{data.table}}, \code{\link{copy}}, \code{\link{setalloccol}}, \code{\link{truelength}}, \code{\link{set}}, \code{\link{.Last.updated}},\code{\link{internal.selfref}} | ||
|
||
| } | ||
| \examples{ | ||
| DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7) | ||
| DT[, c := 8] # add a numeric column, 8 for all rows | ||
| DT[, d := 9L] # add an integer column, 9L for all rows | ||
| DT[, d := 9L] # add an integer column, 9L for all rows\code{\link{.Last.updated}} | ||
|
||
| DT[, c := NULL] # remove column c | ||
| DT[2, d := -8L] # subassign by reference to d; 2nd row is -8L now | ||
| DT # DT changed by reference | ||
|
|
||
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| \name{.internal.selfref} | ||
| \alias{.internal.selfref} | ||
| \title{Internal Self-Reference Attribute in data.table} | ||
| \description{ | ||
| The \code{.internal.selfref} attribute is an internal mechanism used by \code{data.table} to optimize memory management and performance. It acts as a pointer that allows \code{data.table} objects to reference their own memory location. While the \code{.internal.selfref} attribute may appear to always point to \code{NULL} when inspected directly, this is a result of its implementation in R's memory management system. The true significance of this attribute lies in its role in supporting reference semantics, which enables efficient in-place modification of \code{data.table} objects without unnecessary copying. | ||
|
||
| } | ||
| \details{ | ||
| The \code{.internal.selfref} attribute is a pointer that ensures that \code{data.table} objects can be modified by reference without redundant memory allocation. This avoids copying when performing in-place modifications such as adding or updating columns, filtering rows, or performing joins. | ||
| While the \code{.internal.selfref} attribute may appear to always point to \code{NULL} when inspected directly, it plays a crucial role in optimizing performance by enabling reference semantics. When the \code{.internal.selfref} is intact, operations on the \code{data.table} can be done efficiently in place. However, if the attribute is lost or corrupted (due to operations that break reference semantics), \code{data.table} reverts to default \code{data.frame}-like behavior, which can result in copying and slower performance. | ||
| Users generally do not need to interact directly with \code{.internal.selfref}, but understanding its purpose can be helpful when debugging issues related to memory usage or unexpected copying behavior. | ||
| \code{.internal.selfref} is automatically managed by \code{data.table} and is not intended to be modified by users. | ||
| } | ||
| \value{ | ||
| The \code{.internal.selfref} attribute is an internal implementation detail and does not produce a value that users would typically interact with. It is invisible during regular \code{data.table} operations. | ||
| } | ||
| \seealso{ | ||
| \code{\link{data.table}}, \code{\link{setkey}}, \code{\link{merge}}, \code{\link{[.data.table}} | ||
| } | ||
| \examples{ | ||
| library(data.table) | ||
|
||
| # Create a data.table | ||
| dt <- data.table(A = 1:5, B = letters[1:5]) | ||
| # Trace memory to check for reference semantics | ||
| tracemem(dt) # Outputs the memory address of the data.table | ||
| # Perform an in-place operation | ||
| dt[, C := A * 2] # Add a new column in place | ||
| # Verify no copying has occurred | ||
| # (The output of tracemem should show no memory change) | ||
| } | ||
| \keyword{internal} | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert this one.