Skip to content

Commit 579566c

Browse files
committed
updated doc
1 parent 59f966c commit 579566c

File tree

2 files changed

+32
-1
lines changed

2 files changed

+32
-1
lines changed

vignettes/datatable-faq.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ For consistency so that when you use data.table in functions that accept varying
5656

5757
You may have heard that it is generally bad practice to refer to columns by number rather than name, though. If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you forget to change all the places in your code which refer to column number 5. That is your fault not R's or data.table's. It's really really bad. Please don't do it. It's the same mantra as professional SQL developers have: never use `select *`, always explicitly select by column name to at least try to be robust to future changes.
5858

59-
Say column 5 is named `"region"` and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write `DT$region` or `DT[["region"]]`; i.e., the same as base R. Using base R's `$` and `[[` on data.table is encouraged. Not when combined with `<-` to assign (use `:=` instead for that) but just to select a single column by name they are encouraged.
59+
Say column 5 is named `"region"` and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write `DT$region` or `DT[["region"]]`; i.e., the same as base R. Using base R's `$` and `[[` on data.table is encouraged. Not when combined with `<-` to assign (use `:=` instead for that) but just to select a single column by name they are encouraged.A key difference, however, is that DT$col may return a reference, while DT[, col] always returns a copy. This can have important consequences and is explained in the vignette("datatable-reference-semantics", package="data.table").
6060

6161
There are some circumstances where referring to a column by number seems like the only way, such as a sequence of columns. In these situations just like data.frame, you can write `DT[, 5:10]` and `DT[,c(1,4,10)]`. However, again, it is more robust (to future changes in your data's number of and ordering of columns) to use a named range such as `DT[,columnRed:columnViolet]` or name each one `DT[,c("columnRed","columnOrange","columnYellow")]`. It is harder work up front, but you will probably thank yourself and your colleagues might thank you in the future. At least you can say you tried your best to write robust code if something does go wrong.
6262

vignettes/datatable-reference-semantics.Rmd

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -363,6 +363,37 @@ However we could improve this functionality further by *shallow* copying instead
363363
## DT_n doesn't get updated
364364
DT_n
365365
```
366+
### c) Selecting columns: `$` vs `[, col]`
367+
368+
When selecting a column as a vector, `DT$col` and `DT[, col]` have a key difference: `DT$col` may return a reference to the data, while `DT[, col]` always returns a copy.
369+
370+
This means a variable created with `$` can change if the `data.table` is modified later, which can be unexpected. A single example demonstrates this behavior and how `copy()` provides a solution:
371+
372+
```{r}
373+
library(data.table)
374+
375+
DT = data.table(a = 1:3)
376+
377+
# Create three variables using the different methods
378+
x_ref = DT$a # 1. By reference with $
379+
y_cpy = DT[, a] # 2. By copy with []
380+
z_cpy = copy(DT$a) # 3. Forced copy with copy()
381+
382+
# Now, modify the original data.table by reference
383+
DT[, a := a + 10L]
384+
385+
# Check the variables:
386+
x_ref
387+
#> [1] 11 12 13
388+
y_cpy
389+
#> [1] 1 2 3
390+
z_cpy
391+
#> [1] 1 2 3
392+
```
393+
394+
To select a single column as a vector, remember:
395+
- `DT[, mycol]` is safer as it always returns a new, independent copy.
396+
- `DT$mycol` is fast but may return a reference. Use `copy(DT$mycol)` to guarantee independence.
366397

367398
## Summary
368399

0 commit comments

Comments
 (0)