You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
\item{by}{ Column names are seen as if they are variables (as in \code{j} when \code{with=TRUE}). The \code{data.table} is then grouped by the \code{by} and \code{j} is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. \code{by} accepts:
101
101
102
102
\itemize{
103
-
\item A single unquoted column name: e.g., \code{DT[, .(sa=sum(a)), by=x]}
103
+
\item A single unquoted column name or expression, e.g., \code{DT[, .(sa=sum(a)), by=x]} or \code{by=x\%\%2}. This is a convenience; for multiple expressions a \code{list()} is required .
104
104
105
-
\item a \code{list()} of expressions of column names: e.g., \code{DT[, .(sa=sum(a)), by=.(x=x>0, y)]}
105
+
\item a \code{list()} of expressions of column names, e.g., \code{DT[, .(sa=sum(a)), by=.(x>0, y)]}. Use a named list to set the names of the resulting grouping columns, e.g., \code{by=.(x_is_positive=x>0, y)}. As a concise shortcut for a \emph{single} expression, you can also use parentheses to name the output column, e.g., \code{by=(grp = x \%\% 2)}.
106
106
107
107
\item a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., \code{DT[, sum(a), by="x,y,z"]}
Copy file name to clipboardExpand all lines: vignettes/datatable-programming.Rmd
+39Lines changed: 39 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -46,6 +46,45 @@ subset(iris, Species == "setosa")
46
46
47
47
Here, `subset` takes the second argument and evaluates it within the scope of the `data.frame` given as its first argument. This removes the need for variable repetition, making it less prone to errors, and makes the code more readable.
48
48
49
+
### Dynamic Grouping and Naming Syntax
50
+
51
+
Besides the programmatic use with `env`, `data.table` offers some powerful and concise syntax for interactive use, especially in the `by` argument.
52
+
53
+
```{r by_syntax_setup_concise}
54
+
d = data.table(x = 1:4, y = 2:5)
55
+
```
56
+
57
+
#### Grouping by Expressions in `by`
58
+
59
+
For convenience, `data.table` allows you to group by a single expression directly without `list()` or `.()`. To name the resulting grouping column, you have two options:
60
+
61
+
```{r by_syntax_naming_concise}
62
+
# 1. The canonical way: a named list (required for multiple expressions)
63
+
d[, sum(y), by = .(grp = x %% 2)]
64
+
65
+
# 2. A concise shortcut: parentheses (for a single expression)
66
+
d[, sum(y), by = (grp = x %% 2)]
67
+
```
68
+
69
+
The `(grp = ...)` syntax is a base R feature that `data.table` leverages to see the intended column name.
70
+
71
+
#### Important Contrast: Naming in `j` vs. `by`
72
+
73
+
This parentheses shortcut for naming does **not** work in `j`. In `j`, you must use the canonical `.(new_name = ...)` syntax to create a named column.
74
+
75
+
```{r by_syntax_j_concise}
76
+
# Correct way to name a new column in `j`
77
+
d[, .(sum_y = sum(y)), by = .(grp = x %% 2)]
78
+
79
+
# This will not create a column named 'sum_y'
80
+
d[, (sum_y = sum(y)), by = .(grp = x %% 2)]
81
+
```
82
+
In the second case, the parentheses cause base R to evaluate the expression, returning only the final value. `data.table` then gives this unnamed result a default column name (`V1`).
83
+
84
+
**Takeaway:**
85
+
* In `by`, `(name = expr)` is a valid shortcut for `.(name = expr)`.
86
+
* In `j`, you must always use `.(name = expr)` to create a named column.
87
+
49
88
## Problem description
50
89
51
90
The problem with this kind of interface is that we cannot easily parameterize the code that uses it. This is because the expressions passed to those functions are substituted before being evaluated.
0 commit comments