Skip to content

Commit 463693f

Browse files
committed
clarified the case of .I
1 parent c27ec26 commit 463693f

File tree

2 files changed

+44
-1
lines changed

2 files changed

+44
-1
lines changed

man/data.table.Rd

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
111111
\item or of the form \code{startcol:endcol}: e.g., \code{DT[, sum(a), by=x:z]}
112112
}
113113
114-
\emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in \code{DT} that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}.
114+
\emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in \code{DT} that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. Note that for rows in \code{i} with no match, the group of matching rows in \code{x} is empty. Special symbols that operate on rows (e.g., \code{.I} or \code{.N}) will therefore evaluate to \code{0} for such groups. This differs from selecting a column from \code{x} (e.g., \code{x$col}), which results in \code{NA} as governed by the \code{nomatch} argument. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}.
115115
116116
\emph{Advanced:} In the \code{X[Y, j]} form of grouping, the \code{j} expression sees variables in \code{X} first, then \code{Y}. We call this \emph{join inherited scope}. If the variable is not in \code{X} or \code{Y} then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.}
117117
@@ -320,6 +320,13 @@ DT[!"a", sum(v), by=.EACHI, on="x"] # same, but using subsets-as-joins
320320
DT[c("b","c"), sum(v), by=.EACHI, on="x"] # same
321321
DT[c("b","c"), sum(v), by=.EACHI, on=.(x)] # same, using on=.()
322322

323+
#' # Why .I is 0 for non-matching rows with by=.EACHI:
324+
#' d1 = data.table(v = c("A", "B", "C", "A", "C"), val = 1:5)
325+
#' d2 = data.table(v = c("D", "A", "G", "C"))
326+
#' # Selecting a column 'val' returns NA for non-matches, per `nomatch=NA`
327+
#' d1[d2, on = .(v), .(val), by = .EACHI]
328+
#' d1[d2, on = .(v), .I, by = .EACHI]
329+
323330
# joins as subsets
324331
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
325332
X

vignettes/datatable-joins.Rmd

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -259,6 +259,42 @@ dt2 = ProductReceived[
259259
identical(dt1, dt2)
260260
```
261261

262+
##### Understanding `j` Evaluation with `by=.EACHI` for Non-Matches
263+
264+
A common point of confusion arises when using special symbols like `.I` in `j` with `by=.EACHI`. The behavior for non-matching rows differs from what you might expect when selecting a regular column.
265+
266+
Let's illustrate with a simple example:
267+
```{r by-eachi-special-symbols}
268+
d1 = data.table(v = c("A", "B", "C", "A", "C"), i_col = 1:5)
269+
d2 = data.table(v = c("D", "A", "G", "C"))
270+
```
271+
272+
*Case 1: Selecting a regular column*
273+
274+
When we select a column from `x (d1)`, non-matching rows from `i (d2)` result in `NA`. This is the standard behavior governed by `nomatch = NA`.
275+
```{r}
276+
d1[d2, on = .(v), .(i_col), by = .EACHI]
277+
```
278+
279+
For the rows `D` and `G` in `d2`, there is no matching row in `d1`, so the value for `i_col` is missing `(NA)`.
280+
281+
*Case 2: Evaluating the special symbol `.I`*
282+
283+
However, when we use the special symbol `.I`, non-matching rows evaluate to `0`.
284+
```{r}
285+
d1[d2, on = .(v), .I, by = .EACHI]
286+
```
287+
288+
The reason for this difference is crucial:
289+
- In Case 1, we are performing a value lookup. A failed lookup results in a missing value (`NA`).
290+
- In Case 2, we are performing an evaluation. The symbol `.I` is defined as "the row indices in `x` for the current group". For non-matching rows like `D`, the group of matching rows in d1 is empty. The set of indices for an empty group is integer(0). data.table represents this zero-length result as a single `0` in the output.
291+
292+
This logic is consistent with other special symbols like `.N` (the number of rows in a group), which also correctly evaluates to `0` for non-matching groups.
293+
294+
```{r}
295+
d1[d2, on = .(v), .N, by = .EACHI]
296+
```
297+
262298
#### 3.1.4. Joining based on several columns
263299

264300
So far we have just joined `data.table`s based on 1 column, but it's important to know that the package can join tables matching several columns.

0 commit comments

Comments
 (0)