Skip to content

Commit ac0aa29

Browse files
committed
Clarify when := drops keys and when to use copy()
FIXES #7409
1 parent 171e272 commit ac0aa29

File tree

2 files changed

+62
-1
lines changed

2 files changed

+62
-1
lines changed

man/assign.Rd

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,32 @@ When \code{LHS} is a factor column and \code{RHS} is a character vector with ite
8181
Unlike \samp{<-} for \code{data.frame}, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given when fractional data is truncated. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then \emph{plonked} into that column slot and we call this \emph{plonk syntax}, or \emph{replace column syntax} if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening and it is clearer to readers of your code that you really do intend to change the column type; e.g., \code{DT[, colA:=as.integer(colA)]}. A plonk occurs whenever you provide a RHS value to \samp{:=} which is \code{nrow} long. When a column is \emph{plonked}, the original column is not updated by reference because that would entail updating every single element of that column whereas the plonk is just one column pointer update.
8282
8383
\code{data.table}s are \emph{not} copied-on-change by \code{:=}, \code{setkey} or any of the other \code{set*} functions. See \code{\link{copy}}.
84+
85+
\subsection{Side effects on keys and other attributes}{
86+
An important side effect to be aware of: when you modify a key column using \code{:=}, the key attribute is automatically removed. This happens because modifying the values in a key column would violate the sorted invariant that the key represents. Adding new columns (that are not part of the key) or modifying non-key columns does not affect the key.
87+
88+
For example:
89+
\preformatted{
90+
DT = data.table(a=1:3, b=4:6, key="a")
91+
key(DT) # "a"
92+
DT[, a := a + 10] # modifies key column
93+
key(DT) # NULL - key was removed
94+
95+
DT2 = data.table(a=1:3, b=4:6, key="a")
96+
DT2[, c := 1] # adds new column
97+
key(DT2) # "a" - key preserved
98+
}
99+
100+
If you need to preserve the original data.table and its key attribute, use \code{\link{copy}} before the \code{:=} operation:
101+
\preformatted{
102+
DT = data.table(a=1:3, b=4:6, key="a")
103+
DT_modified = copy(DT)[, a := a + 10]
104+
key(DT) # "a" - original preserved
105+
key(DT_modified) # NULL - copy was modified
106+
}
107+
108+
This is particularly important in contexts where the same data.table may be reused multiple times, such as in testing scenarios, function arguments, or when passing data.tables between different parts of your code. See \code{\link{copy}} and \href{../doc/datatable-reference-semantics.html}{\code{vignette("datatable-reference-semantics")}} for more details on reference semantics.
109+
}
84110
}
85111
86112
\section{Advanced (internals):}{It is easy to see how \emph{sub-assigning} to existing columns is done internally. Removing columns by reference is also straightforward by modifying the vector of column pointers only (using memmove in C). However adding (new) columns is more tricky as to how the \code{data.table} can be grown \emph{by reference}: the list vector of column pointers is \emph{over-allocated}, see \code{\link{truelength}}. By defining \code{:=} in \code{j} we believe update syntax is natural, and scales, but it also bypasses \code{[<-} dispatch and allows \code{:=} to update by reference with no copies of any part of memory at all.

vignettes/datatable-reference-semantics.Rmd

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -363,7 +363,42 @@ However we could improve this functionality further by *shallow* copying instead
363363
## DT_n doesn't get updated
364364
DT_n
365365
```
366-
### c) Selecting columns: `$` / `[[...]]` vs `[, col]`
366+
### c) Side effects on keys and attributes
367+
368+
An important side effect to be aware of: when you modify a key column using `:=`, the key attribute is automatically removed. This happens because modifying the values in a key column would violate the sorted invariant that the key represents. Adding new columns (that are not part of the key) or modifying non-key columns does not affect the key.
369+
370+
```{r}
371+
# Modifying a key column removes the key
372+
DT_keyed = data.table(x = c("a", "a", "b", "b"), y = 1:4, key = "x")
373+
key(DT_keyed)
374+
375+
# Modify the key column
376+
DT_keyed[, x := toupper(x)]
377+
key(DT_keyed) # Key was removed!
378+
379+
# Adding new columns preserves the key
380+
DT_keyed2 = data.table(x = c("a", "a", "b", "b"), y = 1:4, key = "x")
381+
DT_keyed2[, z := y * 2]
382+
key(DT_keyed2) # Key is still present
383+
```
384+
385+
If you need to preserve the original data.table and its key attribute, use `copy()`:
386+
387+
```{r}
388+
DT_original = data.table(x = c("a", "a", "b", "b"), y = 1:4, key = "x")
389+
DT_modified = copy(DT_original)[, x := toupper(x)]
390+
391+
key(DT_original) # Original key preserved
392+
key(DT_modified) # Key removed on the copy
393+
```
394+
395+
This is particularly important when:
396+
397+
* Writing functions that take a data.table as input and use `:=` but shouldn't modify the original
398+
* Testing code where you need to run multiple tests on the same data.table
399+
* Working in contexts where you need to preserve the original state (e.g., in loops or when comparing before/after states)
400+
401+
### d) Selecting columns: `$` / `[[...]]` vs `[, col]`
367402

368403
When you extract a single column as a vector, there is a subtle but important difference between standard R methods ($ and [[...]]) and data.table's j expression. DT$col and DT[['col']] may return a reference to the column, while DT[, col] always returns a copy.
369404

0 commit comments

Comments
 (0)