Skip to content

Commit c26450a

Browse files
Clarify $ vs [, col] behavior in docs (#7381)
* updated FAQ * updated vignettes --------- Co-authored-by: Benjamin Schwendinger <[email protected]>
1 parent 358caa2 commit c26450a

File tree

2 files changed

+26
-1
lines changed

2 files changed

+26
-1
lines changed

vignettes/datatable-faq.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ For consistency so that when you use data.table in functions that accept varying
5656

5757
You may have heard that it is generally bad practice to refer to columns by number rather than name, though. If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you forget to change all the places in your code which refer to column number 5. That is your fault not R's or data.table's. It's really really bad. Please don't do it. It's the same mantra as professional SQL developers have: never use `select *`, always explicitly select by column name to at least try to be robust to future changes.
5858

59-
Say column 5 is named `"region"` and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write `DT$region` or `DT[["region"]]`; i.e., the same as base R. Using base R's `$` and `[[` on data.table is encouraged. Not when combined with `<-` to assign (use `:=` instead for that) but just to select a single column by name they are encouraged.
59+
Say column 5 is named `"region"` and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write `DT$region` or `DT[["region"]]`; i.e., the same as base R. Using base R's `$` and `[[` on data.table is encouraged. Not when combined with `<-` to assign (use `:=` instead for that) but just to select a single column by name they are encouraged. A key difference, however, is that DT$col and DT[['col']] may return a reference, while DT[, col] always returns a copy. This can have important consequences and is explained in the `vignette("datatable-reference-semantics", package="data.table")`.
6060

6161
There are some circumstances where referring to a column by number seems like the only way, such as a sequence of columns. In these situations just like data.frame, you can write `DT[, 5:10]` and `DT[,c(1,4,10)]`. However, again, it is more robust (to future changes in your data's number of and ordering of columns) to use a named range such as `DT[,columnRed:columnViolet]` or name each one `DT[,c("columnRed","columnOrange","columnYellow")]`. It is harder work up front, but you will probably thank yourself and your colleagues might thank you in the future. At least you can say you tried your best to write robust code if something does go wrong.
6262

vignettes/datatable-reference-semantics.Rmd

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -363,6 +363,31 @@ However we could improve this functionality further by *shallow* copying instead
363363
## DT_n doesn't get updated
364364
DT_n
365365
```
366+
### c) Selecting columns: `$` / `[[...]]` vs `[, col]`
367+
368+
When you extract a single column as a vector, there is a subtle but important difference between standard R methods ($ and [[...]]) and data.table's j expression. DT$col and DT[['col']] may return a reference to the column, while DT[, col] always returns a copy.
369+
370+
A short example:
371+
```{r}
372+
DT = data.table(a = 1:3)
373+
374+
# three ways to get the column
375+
x_ref = DT$a # may be a reference
376+
y_cpy = DT[, a] # always a copy
377+
z_cpy = copy(DT$a) # forced copy
378+
379+
# modify DT by reference
380+
DT[, a := a + 10L]
381+
382+
# observe results
383+
x_ref # may show 11 12 13
384+
y_cpy # 1 2 3
385+
z_cpy # 1 2 3
386+
```
387+
388+
To select a single column as a vector, remember:
389+
- `DT[, mycol]` is safer as it always returns a new, independent copy.
390+
- `DT$mycol` is fast but may return a reference. Use `copy(DT$mycol)` to guarantee independence.
366391

367392
## Summary
368393

0 commit comments

Comments
 (0)