Skip to content
Merged
2 changes: 1 addition & 1 deletion vignettes/datatable-faq.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ For consistency so that when you use data.table in functions that accept varying

You may have heard that it is generally bad practice to refer to columns by number rather than name, though. If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you forget to change all the places in your code which refer to column number 5. That is your fault not R's or data.table's. It's really really bad. Please don't do it. It's the same mantra as professional SQL developers have: never use `select *`, always explicitly select by column name to at least try to be robust to future changes.

Say column 5 is named `"region"` and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write `DT$region` or `DT[["region"]]`; i.e., the same as base R. Using base R's `$` and `[[` on data.table is encouraged. Not when combined with `<-` to assign (use `:=` instead for that) but just to select a single column by name they are encouraged.
Say column 5 is named `"region"` and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write `DT$region` or `DT[["region"]]`; i.e., the same as base R. Using base R's `$` and `[[` on data.table is encouraged. Not when combined with `<-` to assign (use `:=` instead for that) but just to select a single column by name they are encouraged.A key difference, however, is that DT$col may return a reference, while DT[, col] always returns a copy. This can have important consequences and is explained in the vignette("datatable-reference-semantics", package="data.table").

There are some circumstances where referring to a column by number seems like the only way, such as a sequence of columns. In these situations just like data.frame, you can write `DT[, 5:10]` and `DT[,c(1,4,10)]`. However, again, it is more robust (to future changes in your data's number of and ordering of columns) to use a named range such as `DT[,columnRed:columnViolet]` or name each one `DT[,c("columnRed","columnOrange","columnYellow")]`. It is harder work up front, but you will probably thank yourself and your colleagues might thank you in the future. At least you can say you tried your best to write robust code if something does go wrong.

Expand Down
31 changes: 31 additions & 0 deletions vignettes/datatable-reference-semantics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,37 @@ However we could improve this functionality further by *shallow* copying instead
## DT_n doesn't get updated
DT_n
```
### c) Selecting columns: `$` vs `[, col]`

When selecting a column as a vector, `DT$col` and `DT[, col]` have a key difference: `DT$col` may return a reference to the data, while `DT[, col]` always returns a copy.

This means a variable created with `$` can change if the `data.table` is modified later, which can be unexpected. A single example demonstrates this behavior and how `copy()` provides a solution:

```{r}
library(data.table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need


DT = data.table(a = 1:3)

# Create three variables using the different methods
x_ref = DT$a # 1. By reference with $
y_cpy = DT[, a] # 2. By copy with []
z_cpy = copy(DT$a) # 3. Forced copy with copy()

# Now, modify the original data.table by reference
DT[, a := a + 10L]

# Check the variables:
x_ref
#> [1] 11 12 13
y_cpy
#> [1] 1 2 3
z_cpy
#> [1] 1 2 3
```

To select a single column as a vector, remember:
- `DT[, mycol]` is safer as it always returns a new, independent copy.
- `DT$mycol` is fast but may return a reference. Use `copy(DT$mycol)` to guarantee independence.

## Summary

Expand Down
Loading