Clarified setindex vs setkey subsetting behavior (#7047)

venom1204 · MichaelChirico · web-flow · commit fb419f1a672e · 2025-06-09T13:18:42.000-07:00
* updated docs

* updated

* changed directory

* Update man/setkey.Rd

Co-authored-by: Michael Chirico &lt;michaelchirico4@gmail.com&gt;

* minor

* setkey.rd

* simplify

* shrink diff

* restore blank line

* simplify

* unclear, and not sure it's correct

* give chunk names

---------

Co-authored-by: Michael Chirico &lt;michaelchirico4@gmail.com&gt;
Co-authored-by: Michael Chirico &lt;chiricom@google.com&gt;
diff --git a/man/data.table.Rd b/man/data.table.Rd
@@ -174,7 +174,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
             \item For convenience during interactive scenarios, it is also possible to use \code{.()} syntax as \code{X[Y, on=.(a, b)]}.
             \item From v1.9.8, (non-equi) joins using binary operators \code{>=, >, <=, <} are also possible, e.g., \code{X[Y, on=c("x>=a", "y<=b")]}, or for interactive use as \code{X[Y, on=.(x>=a, y<=b)]}.
         }
-        See examples as well as \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{\code{vignette("datatable-secondary-indices-and-auto-indexing")}}.
+        Note that providing \code{on} is \emph{required} for \code{X[Y]} joins when \code{X} is unkeyed. See examples as well as \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{\code{vignette("datatable-secondary-indices-and-auto-indexing")}}.
     }
 
   \item{env}{ List or an environment, passed to \code{\link{substitute2}} for substitution of parameters in \code{i}, \code{j} and \code{by} (or \code{keyby}). Use \code{verbose} to preview constructed expressions. For more details see \href{../doc/datatable-programming.html}{\code{vignette("datatable-programming")}}. }
@@ -298,7 +298,9 @@ DT[, sum(v), by=x][order(x)]   # same but by chaining expressions together
 
 # fast ad hoc row subsets (subsets as joins)
 DT["a", on="x"]                # same as x == "a" but uses binary search (fast)
+                               #   NB: requires DT to be keyed!
 DT["a", on=.(x)]               # same, for convenience, no need to quote every column
+                               #   NB: works regardless of whether or not DT is keyed!
 DT[.("a"), on="x"]             # same
 DT[x=="a"]                     # same, single "==" internally optimised to use binary search (fast)
 DT[x!="b" | y!=3]              # not yet optimized, currently vector scan subset
diff --git a/man/setkey.Rd b/man/setkey.Rd
@@ -74,6 +74,14 @@ The sort is \emph{stable}; i.e., the order of ties (if any) is preserved.
 For character vectors, \code{data.table} takes advantage of R's internal global string cache, also exported as \code{\link{chorder}}.
 }
 
+\section{Keys vs. Indices}{
+Setting a key (with \code{setkey}) and an index (with \code{setindex}) are similar, but have very important distinctions.
+
+Setting a key physically reorders the data in RAM.
+
+Setting an index computes the sort order, but instead of applying the reordering, simply \emph{stores} this computed ordering. That means that multiple indices can coexist, and that the original row order is preserved.
+}
+
 \section{Good practice}{
 In general, it's good practice to use column names rather than numbers. This is
 why \code{setkey} and \code{setkeyv} only accept column names.
diff --git a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
@@ -62,6 +62,26 @@ Secondary indices are similar to `keys` in *data.table*, except for two major di
 
 * There can be more than one secondary index for a data.table (as we will see below).
 
+#### Keyed vs. Indexed Subsetting
+
+While both **keys** and **indices** enable fast binary search subsetting, they differ significantly in usage:
+
+**Keyed subsetting** (implicit column matching)
+
+```{r keyed_operations}
+DT = data.table(a = c(TRUE, FALSE), b = 1:2)
+setkey(DT, a)                # Set key, reordering DT
+DT[.(TRUE)]                  # 'on' is optional; if omitted, the key is used
+```
+
+**Indexed subsetting** (explicit column specification)
+
+```{r unkeyed_operations}
+DT = data.table(a = c(TRUE, FALSE), b = 1:2)
+setindex(DT, a)              # Set index only (no reorder)
+DT[.(TRUE), on = "a"]        # 'on' is required
+```
+
 ### b) Set and get secondary indices
 
 #### -- How can we set the column `origin` as a secondary index in the *data.table* `flights`?