Clarify $ vs [, col] behavior in docs (#7381)

venom1204 · ben-schwen · web-flow · commit c26450af1d7b · 2025-10-25T14:48:06.000+02:00
* updated FAQ

* updated vignettes

---------

Co-authored-by: Benjamin Schwendinger &lt;52290390+ben-schwen@users.noreply.github.com&gt;
diff --git a/vignettes/datatable-faq.Rmd b/vignettes/datatable-faq.Rmd
@@ -56,7 +56,7 @@ For consistency so that when you use data.table in functions that accept varying
 
 You may have heard that it is generally bad practice to refer to columns by number rather than name, though. If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you forget to change all the places in your code which refer to column number 5. That is your fault not R's or data.table's. It's really really bad. Please don't do it. It's the same mantra as professional SQL developers have: never use `select *`, always explicitly select by column name to at least try to be robust to future changes.
 
-Say column 5 is named `"region"` and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write `DT$region` or `DT[["region"]]`; i.e., the same as base R. Using base R's `$` and `[[` on data.table is encouraged. Not when combined with `<-` to assign (use `:=` instead for that) but just to select a single column by name they are encouraged.
+Say column 5 is named `"region"` and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write `DT$region` or `DT[["region"]]`; i.e., the same as base R. Using base R's `$` and `[[` on data.table is encouraged. Not when combined with `<-` to assign (use `:=` instead for that) but just to select a single column by name they are encouraged. A key difference, however, is that DT$col and DT[['col']] may return a reference, while DT[, col] always returns a copy. This can have important consequences and is explained in the `vignette("datatable-reference-semantics", package="data.table")`.
 
 There are some circumstances where referring to a column by number seems like the only way, such as a sequence of columns. In these situations just like data.frame, you can write `DT[, 5:10]` and `DT[,c(1,4,10)]`. However, again, it is more robust (to future changes in your data's number of and ordering of columns) to use a named range such as `DT[,columnRed:columnViolet]` or name each one `DT[,c("columnRed","columnOrange","columnYellow")]`. It is harder work up front, but you will probably thank yourself and your colleagues might thank you in the future. At least you can say you tried your best to write robust code if something does go wrong.
 
diff --git a/vignettes/datatable-reference-semantics.Rmd b/vignettes/datatable-reference-semantics.Rmd
@@ -363,6 +363,31 @@ However we could improve this functionality further by *shallow* copying instead
     ## DT_n doesn't get updated
     DT_n
     ```
+### c) Selecting columns: `$` / `[[...]]` vs `[, col]`
+
+When you extract a single column as a vector, there is a subtle but important difference between standard R methods ($ and [[...]]) and data.table's j expression. DT$col and DT[['col']] may return a reference to the column, while DT[, col] always returns a copy.
+
+A short example:
+```{r}
+DT = data.table(a = 1:3)
+
+# three ways to get the column
+x_ref = DT$a        # may be a reference
+y_cpy = DT[, a]     # always a copy
+z_cpy = copy(DT$a)  # forced copy
+
+# modify DT by reference
+DT[, a := a + 10L]
+
+# observe results
+x_ref   # may show 11 12 13
+y_cpy   # 1 2 3
+z_cpy   # 1 2 3
+```
+
+To select a single column as a vector, remember:
+- `DT[, mycol]` is safer as it always returns a new, independent copy.
+- `DT$mycol` is fast but may return a reference. Use `copy(DT$mycol)` to guarantee independence.
 
 ## Summary