Skip to content

Commit fb419f1

Browse files
Clarified setindex vs setkey subsetting behavior (#7047)
* updated docs * updated * changed directory * Update man/setkey.Rd Co-authored-by: Michael Chirico <[email protected]> * minor * setkey.rd * simplify * shrink diff * restore blank line * simplify * unclear, and not sure it's correct * give chunk names --------- Co-authored-by: Michael Chirico <[email protected]> Co-authored-by: Michael Chirico <[email protected]>
1 parent 5bbc4d5 commit fb419f1

File tree

3 files changed

+31
-1
lines changed

3 files changed

+31
-1
lines changed

man/data.table.Rd

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
174174
\item For convenience during interactive scenarios, it is also possible to use \code{.()} syntax as \code{X[Y, on=.(a, b)]}.
175175
\item From v1.9.8, (non-equi) joins using binary operators \code{>=, >, <=, <} are also possible, e.g., \code{X[Y, on=c("x>=a", "y<=b")]}, or for interactive use as \code{X[Y, on=.(x>=a, y<=b)]}.
176176
}
177-
See examples as well as \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{\code{vignette("datatable-secondary-indices-and-auto-indexing")}}.
177+
Note that providing \code{on} is \emph{required} for \code{X[Y]} joins when \code{X} is unkeyed. See examples as well as \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{\code{vignette("datatable-secondary-indices-and-auto-indexing")}}.
178178
}
179179
180180
\item{env}{ List or an environment, passed to \code{\link{substitute2}} for substitution of parameters in \code{i}, \code{j} and \code{by} (or \code{keyby}). Use \code{verbose} to preview constructed expressions. For more details see \href{../doc/datatable-programming.html}{\code{vignette("datatable-programming")}}. }
@@ -298,7 +298,9 @@ DT[, sum(v), by=x][order(x)] # same but by chaining expressions together
298298

299299
# fast ad hoc row subsets (subsets as joins)
300300
DT["a", on="x"] # same as x == "a" but uses binary search (fast)
301+
# NB: requires DT to be keyed!
301302
DT["a", on=.(x)] # same, for convenience, no need to quote every column
303+
# NB: works regardless of whether or not DT is keyed!
302304
DT[.("a"), on="x"] # same
303305
DT[x=="a"] # same, single "==" internally optimised to use binary search (fast)
304306
DT[x!="b" | y!=3] # not yet optimized, currently vector scan subset

man/setkey.Rd

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,14 @@ The sort is \emph{stable}; i.e., the order of ties (if any) is preserved.
7474
For character vectors, \code{data.table} takes advantage of R's internal global string cache, also exported as \code{\link{chorder}}.
7575
}
7676

77+
\section{Keys vs. Indices}{
78+
Setting a key (with \code{setkey}) and an index (with \code{setindex}) are similar, but have very important distinctions.
79+
80+
Setting a key physically reorders the data in RAM.
81+
82+
Setting an index computes the sort order, but instead of applying the reordering, simply \emph{stores} this computed ordering. That means that multiple indices can coexist, and that the original row order is preserved.
83+
}
84+
7785
\section{Good practice}{
7886
In general, it's good practice to use column names rather than numbers. This is
7987
why \code{setkey} and \code{setkeyv} only accept column names.

vignettes/datatable-secondary-indices-and-auto-indexing.Rmd

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,26 @@ Secondary indices are similar to `keys` in *data.table*, except for two major di
6262

6363
* There can be more than one secondary index for a data.table (as we will see below).
6464

65+
#### Keyed vs. Indexed Subsetting
66+
67+
While both **keys** and **indices** enable fast binary search subsetting, they differ significantly in usage:
68+
69+
**Keyed subsetting** (implicit column matching)
70+
71+
```{r keyed_operations}
72+
DT = data.table(a = c(TRUE, FALSE), b = 1:2)
73+
setkey(DT, a) # Set key, reordering DT
74+
DT[.(TRUE)] # 'on' is optional; if omitted, the key is used
75+
```
76+
77+
**Indexed subsetting** (explicit column specification)
78+
79+
```{r unkeyed_operations}
80+
DT = data.table(a = c(TRUE, FALSE), b = 1:2)
81+
setindex(DT, a) # Set index only (no reorder)
82+
DT[.(TRUE), on = "a"] # 'on' is required
83+
```
84+
6585
### b) Set and get secondary indices
6686

6787
#### -- How can we set the column `origin` as a secondary index in the *data.table* `flights`?

0 commit comments

Comments
 (0)