Merge remote-tracking branch 'origin/master' into consistentcolreplacement

joshhwuu · joshhwuu · commit ac8ce38e1c90 · 2024-06-19T17:31:52.000-07:00
diff --git a/NEWS.md b/NEWS.md
@@ -98,7 +98,13 @@
 
 15. `dcast()` now issues a warning when `fun.aggregate` is used but not provided by the user. `fun.aggregate` defaults to `length` in this case. Previously, only a message was issued. However, relying on this default often signals unexpected duplicates in the data. Therefore, a stricter class of signal was deemed more appropriate, [#5386](https://github.com/Rdatatable/data.table/issues/5386). The warning is classed as `dt_missing_fun_aggregate_warning`, allowing for more targeted handling in user code. Thanks @MichaelChirico for the suggestion and @Nj221102 for the fix.
 
-16. Assigning `list(NULL)` to a list column now replaces the column with `list(NULL)`, instead of deleting the column [#5558](https://github.com/Rdatatable/data.table/issues/5558). This behavior is now consistent with base `data.frame`. Thanks @tdhock for the report and @joshhwuu for the fix.
+16. `print.data.table` gains new argument `show.indices` and option `datatable.show.indices` that allows the user to print a `data.table`'s indices as columns without having to modify the `data.table` itself. Thanks @MichaelChirico for the report and @joshhwuu for the PR.
+
+17. Assigning `list(NULL)` to a list column now replaces the column with `list(NULL)`, instead of deleting the column [#5558](https://github.com/Rdatatable/data.table/issues/5558). This behavior is now consistent with base `data.frame`. Thanks @tdhock for the report and @joshhwuu for the fix.
+
+## TRANSLATIONS
+
+1. Fix a typo in a Mandarin translation of an error message that was hiding the actual error message, [#6172](https://github.com/Rdatatable/data.table/issues/6172). Thanks @trafficfan for the report and @MichaelChirico for the fix.
 
 # data.table [v1.15.0](https://github.com/Rdatatable/data.table/milestone/29)  (30 Jan 2024)
 
diff --git a/R/onLoad.R b/R/onLoad.R
@@ -79,6 +79,7 @@
        "datatable.print.colnames"="'auto'",    # for print.data.table
        "datatable.print.keys"="TRUE",          # for print.data.table
        "datatable.print.trunc.cols"="FALSE",   # for print.data.table
+       "datatable.show.indices"="FALSE",       # for print.data.table
        "datatable.allow.cartesian"="FALSE",    # datatable.<argument name>
        "datatable.dfdispatchwarn"="TRUE",                   # not a function argument
        "datatable.warnredundantby"="TRUE",                  # not a function argument
diff --git a/R/print.data.table.R b/R/print.data.table.R
@@ -7,6 +7,7 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
                col.names=getOption("datatable.print.colnames"),
                print.keys=getOption("datatable.print.keys"),
                trunc.cols=getOption("datatable.print.trunc.cols"),
+               show.indices=getOption("datatable.show.indices"),
                quote=FALSE,
                na.print=NULL,
                timezone=FALSE, ...) {
@@ -64,15 +65,28 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
     }
     return(invisible(x))
   }
+  if (show.indices) {
+    if (is.null(indices(x))) {
+      show.indices = FALSE
+    } else {
+      index_dt <- as.data.table(attributes(attr(x, 'index')))
+      print_names <- paste0("index", if (ncol(index_dt) > 1L) seq_len(ncol(index_dt)) else "", ":", sub("^__", "", names(index_dt)))
+      setnames(index_dt, print_names)
+    }
+  }
   n_x = nrow(x)
   if ((topn*2L+1L)<n_x && (n_x>nrows || !topnmiss)) {
     toprint = rbindlist(list(head(x, topn), tail(x, topn)), use.names=FALSE)  # no need to match names because head and tail of same x, and #3306
     rn = c(seq_len(topn), seq.int(to=n_x, length.out=topn))
     printdots = TRUE
+    idx = c(seq_len(topn), seq(to=nrow(x), length.out=topn))
+    toprint = x[idx, ]
+    if (show.indices) toprint = cbind(toprint, index_dt[idx, ])
   } else {
     toprint = x
     rn = seq_len(n_x)
     printdots = FALSE
+    if (show.indices) toprint = cbind(toprint, index_dt)
   }
   toprint=format.data.table(toprint, na.encode=FALSE, timezone = timezone, ...)  # na.encode=FALSE so that NA in character cols print as <NA>
   require_bit64_if_needed(x)
diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
diff --git a/man/assign.Rd b/man/assign.Rd
@@ -87,7 +87,7 @@ While in most cases standard and functional form of \code{:=} are interchangeabl
 Since \code{[.data.table} incurs overhead to check the existence and type of arguments (for example), \code{set()} provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a \code{for} loop. See examples. \code{:=} is more powerful and flexible than \code{set()} because \code{:=} is intended to be combined with \code{i} and \code{by} in single queries on large datasets.
 }
 \note{
-    \code{DT[a > 4, b := c]} is different from \code{DT[a > 4][, b := c]}. The first expression updates (or adds) column \code{b} with the value \code{c} on those rows where \code{a > 4} evaluates to \code{TRUE}. \code{X} is updated \emph{by reference}, therefore no assignment needed.
+    \code{DT[a > 4, b := c]} is different from \code{DT[a > 4][, b := c]}. The first expression updates (or adds) column \code{b} with the value \code{c} on those rows where \code{a > 4} evaluates to \code{TRUE}. \code{X} is updated \emph{by reference}, therefore no assignment needed.  Note that this does not apply when `i` is missing, i.e. \code{DT[]}.
 
     The second expression on the other hand updates a \emph{new} \code{data.table} that's returned by the subset operation. Since the subsetted data.table is ephemeral (it is not assigned to a symbol), the result would be lost; unless the result is assigned, for example, as follows: \code{ans <- DT[a > 4][, b := c]}.
 }
@@ -142,6 +142,21 @@ sq_col_idx = grep('d$', names(DT))
 DT[ , (sq_col_idx) := lapply(.SD, dnorm),
    .SDcols = sq_col_idx]
 
+# Examples using `set` function
+## Set value for single cell
+set(DT, 1L, "b", 10L)
+## Set values for multiple columns in a specific row
+set(DT, 2L, c("b", "d"), list(20L, 30L))
+## Set values by column indices
+set(DT, 3L, c(2L, 4L), list(40L, 50L))
+## Set value for an entire column without specifying rows
+set(DT, j = "b", value = 100L)
+set(DT, NULL, "b", 100L) # equivalent
+## Set values for multiple columns without specifying rows
+set(DT, j = c("b", "d"), value = list(200L, 300L))
+## Set values for multiple columns with multiple specified rows.
+set(DT, c(1L, 3L), c("b", "d"), value = list(500L, 800L))
+
 \dontrun{
 # Speed example:
 
diff --git a/man/print.data.table.Rd b/man/print.data.table.Rd
@@ -25,6 +25,7 @@
     col.names=getOption("datatable.print.colnames"),    # default: "auto"
     print.keys=getOption("datatable.print.keys"),       # default: TRUE
     trunc.cols=getOption("datatable.print.trunc.cols"), # default: FALSE
+    show.indices=getOption("datatable.show.indices"),   # default: FALSE
     quote=FALSE,
     na.print=NULL,
     timezone=FALSE, \dots)
@@ -46,6 +47,7 @@
   \item{col.names}{ One of three flavours for controlling the display of column names in output. \code{"auto"} includes column names above the data, as well as below the table if \code{nrow(x) > 20}. \code{"top"} excludes this lower register when applicable, and \code{"none"} suppresses column names altogether (as well as column classes if \code{class = TRUE}. }
   \item{print.keys}{ If \code{TRUE}, any \code{\link{key}} and/or \code{\link[=indices]{index}} currently assigned to \code{x} will be printed prior to the preview of the data. }
   \item{trunc.cols}{ If \code{TRUE}, only the columns that can be printed in the console without wrapping the columns to new lines will be printed (similar to \code{tibbles}). }
+  \item{show.indices}{ If \code{TRUE}, indices will be printed as columns alongside \code{x}. }
   \item{quote}{ If \code{TRUE}, all output will appear in quotes, as in \code{print.default}. }
   \item{timezone}{ If \code{TRUE}, time columns of class POSIXct or POSIXlt will be printed with their timezones (if attribute is available). }
   \item{na.print}{ The string to be printed in place of \code{NA} values, as in \code{print.default}. }
@@ -116,6 +118,19 @@
   x = data.table(z = c(1 + 3i, 2 - 1i, pi + 2.718i))
   print(x)
 
+  old = options(datatable.show.indices=TRUE)
+  NN = 200
+  set.seed(2024)
+  DT = data.table(
+    grp1 = sample(100, NN, TRUE),
+    grp2 = sample(90, NN, TRUE),
+    grp3 = sample(80, NN, TRUE)
+  )
+  setkey(DT, grp1, grp2)
+  setindex(DT, grp1, grp3)
+  print(DT)
+  options(old)
+
   iris = as.data.table(iris)
   iris_agg = iris[ , .(reg = list(lm(Sepal.Length ~ Petal.Length))), by = Species]
   format_list_item.lm = function(x, ...) sprintf('<lm:\%s>', format(x$call$formula))
diff --git a/po/R-zh_CN.po b/po/R-zh_CN.po
@@ -371,7 +371,7 @@ msgstr "变量 '%s' 并没有存在于调用环境中。之所以在调用环境
 msgid ""
 "Both '%1$s' and '..%1$s' exist in calling scope. Please remove the '..%1$s' "
 "variable in calling scope for clarity."
-msgstr "'%1%s'和'..%1$s'均在当前调用环境中。为清晰起见，请移除在调用环境中名为"
+msgstr "'%1$s'和'..%1$s'均在当前调用环境中。为清晰起见，请移除在调用环境中名为"
 "..%1$s' 的变量。"
 
 #: data.table.R:288
diff --git a/vignettes/datatable-intro.Rmd b/vignettes/datatable-intro.Rmd
@@ -250,6 +250,49 @@ The function `length()` requires an input argument. We just need to compute the
 
 This type of operation occurs quite frequently, especially while grouping (as we will see in the next section), to the point where `data.table` provides a *special symbol* `.N` for it.
 
+### g) Handle non-existing elements in `i`
+
+#### -- What happens when querying for non-existing elements?
+
+When querying a `data.table` for elements that do not exist, the behavior differs based on the method used.
+
+```r
+setkeyv(flights, "origin")
+```
+
+* **Key-based subsetting: `dt["d"]`**
+
+  This performs a right join on the key column `x`, resulting in a row with `d` and `NA` for columns not found. When using `setkeyv`, the table is sorted by the specified keys and an internal index is created, enabling binary search for efficient subsetting.
+
+  ```r
+  flights["XYZ"]
+  # Returns:
+  #    origin year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum ...
+  # 1:    XYZ   NA    NA  NA       NA             NA        NA       NA             NA        NA      NA     NA      NA ...
+  ```
+
+* **Logical subsetting: `dt[x == "d"]`**
+  
+  This performs a standard subset operation that does not find any matching rows and thus returns an empty `data.table`.
+
+  ```r
+    flights[origin == "XYZ"]
+  # Returns:
+  # Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,...
+  ```
+
+* **Exact match using `nomatch=NULL`**
+  
+  For exact matches without `NA` for non-existing elements, use `nomatch=NULL`:
+
+  ```r
+  flights["XYZ", nomatch=NULL]
+  # Returns:
+  # Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,...
+  ```
+
+Understanding these behaviors can help prevent confusion when dealing with non-existing elements in your data.
+
 #### Special symbol `.N`: {#special-N}
 
 `.N` is a special built-in variable that holds the number of observations _in the current group_. It is particularly useful when combined with `by` as we'll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.
@@ -269,7 +312,7 @@ ans
 
 We could have accomplished the same operation by doing `nrow(flights[origin == "JFK" & month == 6L])`. However, it would have to subset the entire `data.table` first corresponding to the *row indices* in `i` *and then* return the rows using `nrow()`, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the *`data.table` design* vignette.
 
-### g) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
+### h) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
 
 If you're writing out the column names explicitly, there's no difference compared to a `data.frame` (since v1.9.8).
 
diff --git a/vignettes/datatable-reference-semantics.Rmd b/vignettes/datatable-reference-semantics.Rmd
@@ -78,6 +78,8 @@ A *shallow* copy is just a copy of the vector of column pointers (corresponding
 
 A *deep* copy on the other hand copies the entire data to another location in memory.
 
+When subsetting a *data.table* using `i` (e.g., `DT[1:10]`), a *deep* copy is made. However, when `i` is not provided or equals `TRUE`, a *shallow* copy is made.
+
 #
 With *data.table's* `:=` operator, absolutely no copies are made in *both* (1) and (2), irrespective of R version you are using. This is because `:=` operator updates *data.table* columns *in-place* (by reference).
 

Original file line number	Diff line number	Diff line change
`@@ -78,6 +78,8 @@ A shallow copy is just a copy of the vector of column pointers (corresponding`
`78`	`78`
`79`	`79`	`A deep copy on the other hand copies the entire data to another location in memory.`
`80`	`80`
	`81`	+When subsetting a data.table using `i` (e.g., `DT[1:10]`), a deep copy is made. However, when `i` is not provided or equals `TRUE`, a shallow copy is made.
	`82`	`+`
`81`	`83`	`#`
`82`	`84`	With data.table's `:=` operator, absolutely no copies are made in both (1) and (2), irrespective of R version you are using. This is because `:=` operator updates data.table columns in-place (by reference).
`83`	`85`