You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: man/assign.Rd
+26Lines changed: 26 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -81,6 +81,32 @@ When \code{LHS} is a factor column and \code{RHS} is a character vector with ite
81
81
Unlike \samp{<-} for \code{data.frame}, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given when fractional data is truncated. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then \emph{plonked} into that column slot and we call this \emph{plonk syntax}, or \emph{replace column syntax} if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening and it is clearer to readers of your code that you really do intend to change the column type; e.g., \code{DT[, colA:=as.integer(colA)]}. A plonk occurs whenever you provide a RHS value to \samp{:=} which is \code{nrow} long. When a column is \emph{plonked}, the original column is not updated by reference because that would entail updating every single element of that column whereas the plonk is just one column pointer update.
82
82
83
83
\code{data.table}s are \emph{not} copied-on-change by \code{:=}, \code{setkey} or any of the other \code{set*} functions. See \code{\link{copy}}.
84
+
85
+
\subsection{Side effects on keys and other attributes}{
86
+
An important side effect to be aware of: when you modify a key column using \code{:=}, the key attribute is automatically removed. This happens because modifying the values in a key column would violate the sorted invariant that the key represents. Adding new columns (that are not part of the key) or modifying non-key columns does not affect the key.
87
+
88
+
For example:
89
+
\preformatted{
90
+
DT = data.table(a=1:3, b=4:6, key="a")
91
+
key(DT) # "a"
92
+
DT[, a := a + 10] # modifies key column
93
+
key(DT) # NULL - key was removed
94
+
95
+
DT2 = data.table(a=1:3, b=4:6, key="a")
96
+
DT2[, c := 1] # adds new column
97
+
key(DT2) # "a" - key preserved
98
+
}
99
+
100
+
If you need to preserve the original data.table and its key attribute, use \code{\link{copy}} before the \code{:=} operation:
101
+
\preformatted{
102
+
DT = data.table(a=1:3, b=4:6, key="a")
103
+
DT_modified = copy(DT)[, a := a + 10]
104
+
key(DT) # "a" - original preserved
105
+
key(DT_modified) # NULL - copy was modified
106
+
}
107
+
108
+
This is particularly important in contexts where the same data.table may be reused multiple times, such as in testing scenarios, function arguments, or when passing data.tables between different parts of your code. See \code{\link{copy}} and \href{../doc/datatable-reference-semantics.html}{\code{vignette("datatable-reference-semantics")}} for more details on reference semantics.
109
+
}
84
110
}
85
111
86
112
\section{Advanced (internals):}{It is easy to see how \emph{sub-assigning} to existing columns is done internally. Removing columns by reference is also straightforward by modifying the vector of column pointers only (using memmove in C). However adding (new) columns is more tricky as to how the \code{data.table} can be grown \emph{by reference}: the list vector of column pointers is \emph{over-allocated}, see \code{\link{truelength}}. By defining \code{:=} in \code{j} we believe update syntax is natural, and scales, but it also bypasses \code{[<-} dispatch and allows \code{:=} to update by reference with no copies of any part of memory at all.
Copy file name to clipboardExpand all lines: vignettes/datatable-reference-semantics.Rmd
+36-1Lines changed: 36 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -363,7 +363,42 @@ However we could improve this functionality further by *shallow* copying instead
363
363
## DT_n doesn't get updated
364
364
DT_n
365
365
```
366
-
### c) Selecting columns: `$` / `[[...]]` vs `[, col]`
366
+
### c) Side effects on keys and attributes
367
+
368
+
An important side effect to be aware of: when you modify a key column using `:=`, the key attribute is automatically removed. This happens because modifying the values in a key column would violate the sorted invariant that the key represents. Adding new columns (that are not part of the key) or modifying non-key columns does not affect the key.
DT_modified = copy(DT_original)[, x := toupper(x)]
390
+
391
+
key(DT_original) # Original key preserved
392
+
key(DT_modified) # Key removed on the copy
393
+
```
394
+
395
+
This is particularly important when:
396
+
397
+
* Writing functions that take a data.table as input and use `:=` but shouldn't modify the original
398
+
* Testing code where you need to run multiple tests on the same data.table
399
+
* Working in contexts where you need to preserve the original state (e.g., in loops or when comparing before/after states)
400
+
401
+
### d) Selecting columns: `$` / `[[...]]` vs `[, col]`
367
402
368
403
When you extract a single column as a vector, there is a subtle but important difference between standard R methods ($ and [[...]]) and data.table's j expression. DT$col and DT[['col']] may return a reference to the column, while DT[, col] always returns a copy.
0 commit comments