Merge branch 'macroRemoval' of https://github.com/Rdatatable/data.table into macroRemoval

badasahog · badasahog · commit 4d179491c879 · 2025-06-16T06:44:45.000-04:00
diff --git a/NEWS.md b/NEWS.md
@@ -38,6 +38,8 @@
 
 9. Joins to extended data.frames, e.g. `x[i, col := x.col1 + i.col2]` where `i` is a `tbl`, can use the `x.` and `i.` prefix forms, [#6998](https://github.com/Rdatatable/data.table/issues/6998). Thanks @MichaelChirico for the bug and PR.
 
+10. On a heavily loaded machine, a `forder` thread could try to perform a zero-length copy from a null pointer, which was de-facto harmless but is against the C standard and was caught by additional CRAN checks, [#7051](https://github.com/Rdatatable/data.table/issues/7051). Thanks to @helske for the report and @aitap for the PR.
+
 ### NOTES
 
 1. Continued work to remove non-API C functions, [#6180](https://github.com/Rdatatable/data.table/issues/6180). Thanks Ivan Krylov for the PRs and for writing a clear and concise guide about the R API: https://aitap.codeberg.page/R-api/.
diff --git a/man/data.table.Rd b/man/data.table.Rd
@@ -174,7 +174,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
             \item For convenience during interactive scenarios, it is also possible to use \code{.()} syntax as \code{X[Y, on=.(a, b)]}.
             \item From v1.9.8, (non-equi) joins using binary operators \code{>=, >, <=, <} are also possible, e.g., \code{X[Y, on=c("x>=a", "y<=b")]}, or for interactive use as \code{X[Y, on=.(x>=a, y<=b)]}.
         }
-        See examples as well as \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{\code{vignette("datatable-secondary-indices-and-auto-indexing")}}.
+        Note that providing \code{on} is \emph{required} for \code{X[Y]} joins when \code{X} is unkeyed. See examples as well as \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{\code{vignette("datatable-secondary-indices-and-auto-indexing")}}.
     }
 
   \item{env}{ List or an environment, passed to \code{\link{substitute2}} for substitution of parameters in \code{i}, \code{j} and \code{by} (or \code{keyby}). Use \code{verbose} to preview constructed expressions. For more details see \href{../doc/datatable-programming.html}{\code{vignette("datatable-programming")}}. }
@@ -298,7 +298,9 @@ DT[, sum(v), by=x][order(x)]   # same but by chaining expressions together
 
 # fast ad hoc row subsets (subsets as joins)
 DT["a", on="x"]                # same as x == "a" but uses binary search (fast)
+                               #   NB: requires DT to be keyed!
 DT["a", on=.(x)]               # same, for convenience, no need to quote every column
+                               #   NB: works regardless of whether or not DT is keyed!
 DT[.("a"), on="x"]             # same
 DT[x=="a"]                     # same, single "==" internally optimised to use binary search (fast)
 DT[x!="b" | y!=3]              # not yet optimized, currently vector scan subset
diff --git a/man/last.Rd b/man/last.Rd
@@ -18,6 +18,9 @@ of \code{xts::first} is deployed. }
 \item{\dots}{ Not applicable for \code{data.table} first/last. Any arguments here
 are passed through to \code{xts}'s first/last. }
 }
+\details{
+Note: For zero-length vectors, \code{first(x)} and \code{last(x)} mimic \code{head(x, 1)} and \code{tail(x, 1)} by returning an empty vector instead of \code{NA}. However, unlike \code{head()}/\code{tail()} and base R subsetting (e.g., \code{x[1]}), they do not preserve attributes like names.
+}
 \value{
 If no other arguments are supplied it depends on the type of \code{x}. The first/last item
 of a vector or list. The first/last row of a \code{data.frame} or \code{data.table}.
diff --git a/man/setkey.Rd b/man/setkey.Rd
@@ -74,6 +74,14 @@ The sort is \emph{stable}; i.e., the order of ties (if any) is preserved.
 For character vectors, \code{data.table} takes advantage of R's internal global string cache, also exported as \code{\link{chorder}}.
 }
 
+\section{Keys vs. Indices}{
+Setting a key (with \code{setkey}) and an index (with \code{setindex}) are similar, but have very important distinctions.
+
+Setting a key physically reorders the data in RAM.
+
+Setting an index computes the sort order, but instead of applying the reordering, simply \emph{stores} this computed ordering. That means that multiple indices can coexist, and that the original row order is preserved.
+}
+
 \section{Good practice}{
 In general, it's good practice to use column names rather than numbers. This is
 why \code{setkey} and \code{setkeyv} only accept column names.
diff --git a/src/forder.c b/src/forder.c
@@ -128,6 +128,8 @@ static void flush(void) {
   if (!retgrp) return;
   int me = omp_get_thread_num();
   int n = gs_thread_n[me];
+  // normally doesn't happen, can be encountered under heavy load, #7051
+  if (!n) return; // # nocov
   int newn = gs_n + n;
   if (gs_alloc < newn) {
     gs_alloc = (newn < nrow/3) ? (1+(newn*2)/4096)*4096 : nrow;
diff --git a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
@@ -62,6 +62,26 @@ Secondary indices are similar to `keys` in *data.table*, except for two major di
 
 * There can be more than one secondary index for a data.table (as we will see below).
 
+#### Keyed vs. Indexed Subsetting
+
+While both **keys** and **indices** enable fast binary search subsetting, they differ significantly in usage:
+
+**Keyed subsetting** (implicit column matching)
+
+```{r keyed_operations}
+DT = data.table(a = c(TRUE, FALSE), b = 1:2)
+setkey(DT, a)                # Set key, reordering DT
+DT[.(TRUE)]                  # 'on' is optional; if omitted, the key is used
+```
+
+**Indexed subsetting** (explicit column specification)
+
+```{r unkeyed_operations}
+DT = data.table(a = c(TRUE, FALSE), b = 1:2)
+setindex(DT, a)              # Set index only (no reorder)
+DT[.(TRUE), on = "a"]        # 'on' is required
+```
+
 ### b) Set and get secondary indices
 
 #### -- How can we set the column `origin` as a secondary index in the *data.table* `flights`?