Skip to content

Commit ac8ce38

Browse files
committed
Merge remote-tracking branch 'origin/master' into consistentcolreplacement
2 parents ee0a462 + 642d51b commit ac8ce38

File tree

9 files changed

+264
-84
lines changed

9 files changed

+264
-84
lines changed

NEWS.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,13 @@
9898

9999
15. `dcast()` now issues a warning when `fun.aggregate` is used but not provided by the user. `fun.aggregate` defaults to `length` in this case. Previously, only a message was issued. However, relying on this default often signals unexpected duplicates in the data. Therefore, a stricter class of signal was deemed more appropriate, [#5386](https://github.com/Rdatatable/data.table/issues/5386). The warning is classed as `dt_missing_fun_aggregate_warning`, allowing for more targeted handling in user code. Thanks @MichaelChirico for the suggestion and @Nj221102 for the fix.
100100

101-
16. Assigning `list(NULL)` to a list column now replaces the column with `list(NULL)`, instead of deleting the column [#5558](https://github.com/Rdatatable/data.table/issues/5558). This behavior is now consistent with base `data.frame`. Thanks @tdhock for the report and @joshhwuu for the fix.
101+
16. `print.data.table` gains new argument `show.indices` and option `datatable.show.indices` that allows the user to print a `data.table`'s indices as columns without having to modify the `data.table` itself. Thanks @MichaelChirico for the report and @joshhwuu for the PR.
102+
103+
17. Assigning `list(NULL)` to a list column now replaces the column with `list(NULL)`, instead of deleting the column [#5558](https://github.com/Rdatatable/data.table/issues/5558). This behavior is now consistent with base `data.frame`. Thanks @tdhock for the report and @joshhwuu for the fix.
104+
105+
## TRANSLATIONS
106+
107+
1. Fix a typo in a Mandarin translation of an error message that was hiding the actual error message, [#6172](https://github.com/Rdatatable/data.table/issues/6172). Thanks @trafficfan for the report and @MichaelChirico for the fix.
102108
103109
# data.table [v1.15.0](https://github.com/Rdatatable/data.table/milestone/29) (30 Jan 2024)
104110

R/onLoad.R

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@
7979
"datatable.print.colnames"="'auto'", # for print.data.table
8080
"datatable.print.keys"="TRUE", # for print.data.table
8181
"datatable.print.trunc.cols"="FALSE", # for print.data.table
82+
"datatable.show.indices"="FALSE", # for print.data.table
8283
"datatable.allow.cartesian"="FALSE", # datatable.<argument name>
8384
"datatable.dfdispatchwarn"="TRUE", # not a function argument
8485
"datatable.warnredundantby"="TRUE", # not a function argument

R/print.data.table.R

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
77
col.names=getOption("datatable.print.colnames"),
88
print.keys=getOption("datatable.print.keys"),
99
trunc.cols=getOption("datatable.print.trunc.cols"),
10+
show.indices=getOption("datatable.show.indices"),
1011
quote=FALSE,
1112
na.print=NULL,
1213
timezone=FALSE, ...) {
@@ -64,15 +65,28 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
6465
}
6566
return(invisible(x))
6667
}
68+
if (show.indices) {
69+
if (is.null(indices(x))) {
70+
show.indices = FALSE
71+
} else {
72+
index_dt <- as.data.table(attributes(attr(x, 'index')))
73+
print_names <- paste0("index", if (ncol(index_dt) > 1L) seq_len(ncol(index_dt)) else "", ":", sub("^__", "", names(index_dt)))
74+
setnames(index_dt, print_names)
75+
}
76+
}
6777
n_x = nrow(x)
6878
if ((topn*2L+1L)<n_x && (n_x>nrows || !topnmiss)) {
6979
toprint = rbindlist(list(head(x, topn), tail(x, topn)), use.names=FALSE) # no need to match names because head and tail of same x, and #3306
7080
rn = c(seq_len(topn), seq.int(to=n_x, length.out=topn))
7181
printdots = TRUE
82+
idx = c(seq_len(topn), seq(to=nrow(x), length.out=topn))
83+
toprint = x[idx, ]
84+
if (show.indices) toprint = cbind(toprint, index_dt[idx, ])
7285
} else {
7386
toprint = x
7487
rn = seq_len(n_x)
7588
printdots = FALSE
89+
if (show.indices) toprint = cbind(toprint, index_dt)
7690
}
7791
toprint=format.data.table(toprint, na.encode=FALSE, timezone = timezone, ...) # na.encode=FALSE so that NA in character cols print as <NA>
7892
require_bit64_if_needed(x)

inst/tests/tests.Rraw

Lines changed: 164 additions & 80 deletions
Large diffs are not rendered by default.

man/assign.Rd

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ While in most cases standard and functional form of \code{:=} are interchangeabl
8787
Since \code{[.data.table} incurs overhead to check the existence and type of arguments (for example), \code{set()} provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a \code{for} loop. See examples. \code{:=} is more powerful and flexible than \code{set()} because \code{:=} is intended to be combined with \code{i} and \code{by} in single queries on large datasets.
8888
}
8989
\note{
90-
\code{DT[a > 4, b := c]} is different from \code{DT[a > 4][, b := c]}. The first expression updates (or adds) column \code{b} with the value \code{c} on those rows where \code{a > 4} evaluates to \code{TRUE}. \code{X} is updated \emph{by reference}, therefore no assignment needed.
90+
\code{DT[a > 4, b := c]} is different from \code{DT[a > 4][, b := c]}. The first expression updates (or adds) column \code{b} with the value \code{c} on those rows where \code{a > 4} evaluates to \code{TRUE}. \code{X} is updated \emph{by reference}, therefore no assignment needed. Note that this does not apply when `i` is missing, i.e. \code{DT[]}.
9191
9292
The second expression on the other hand updates a \emph{new} \code{data.table} that's returned by the subset operation. Since the subsetted data.table is ephemeral (it is not assigned to a symbol), the result would be lost; unless the result is assigned, for example, as follows: \code{ans <- DT[a > 4][, b := c]}.
9393
}
@@ -142,6 +142,21 @@ sq_col_idx = grep('d$', names(DT))
142142
DT[ , (sq_col_idx) := lapply(.SD, dnorm),
143143
.SDcols = sq_col_idx]
144144

145+
# Examples using `set` function
146+
## Set value for single cell
147+
set(DT, 1L, "b", 10L)
148+
## Set values for multiple columns in a specific row
149+
set(DT, 2L, c("b", "d"), list(20L, 30L))
150+
## Set values by column indices
151+
set(DT, 3L, c(2L, 4L), list(40L, 50L))
152+
## Set value for an entire column without specifying rows
153+
set(DT, j = "b", value = 100L)
154+
set(DT, NULL, "b", 100L) # equivalent
155+
## Set values for multiple columns without specifying rows
156+
set(DT, j = c("b", "d"), value = list(200L, 300L))
157+
## Set values for multiple columns with multiple specified rows.
158+
set(DT, c(1L, 3L), c("b", "d"), value = list(500L, 800L))
159+
145160
\dontrun{
146161
# Speed example:
147162

man/print.data.table.Rd

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
col.names=getOption("datatable.print.colnames"), # default: "auto"
2626
print.keys=getOption("datatable.print.keys"), # default: TRUE
2727
trunc.cols=getOption("datatable.print.trunc.cols"), # default: FALSE
28+
show.indices=getOption("datatable.show.indices"), # default: FALSE
2829
quote=FALSE,
2930
na.print=NULL,
3031
timezone=FALSE, \dots)
@@ -46,6 +47,7 @@
4647
\item{col.names}{ One of three flavours for controlling the display of column names in output. \code{"auto"} includes column names above the data, as well as below the table if \code{nrow(x) > 20}. \code{"top"} excludes this lower register when applicable, and \code{"none"} suppresses column names altogether (as well as column classes if \code{class = TRUE}. }
4748
\item{print.keys}{ If \code{TRUE}, any \code{\link{key}} and/or \code{\link[=indices]{index}} currently assigned to \code{x} will be printed prior to the preview of the data. }
4849
\item{trunc.cols}{ If \code{TRUE}, only the columns that can be printed in the console without wrapping the columns to new lines will be printed (similar to \code{tibbles}). }
50+
\item{show.indices}{ If \code{TRUE}, indices will be printed as columns alongside \code{x}. }
4951
\item{quote}{ If \code{TRUE}, all output will appear in quotes, as in \code{print.default}. }
5052
\item{timezone}{ If \code{TRUE}, time columns of class POSIXct or POSIXlt will be printed with their timezones (if attribute is available). }
5153
\item{na.print}{ The string to be printed in place of \code{NA} values, as in \code{print.default}. }
@@ -116,6 +118,19 @@
116118
x = data.table(z = c(1 + 3i, 2 - 1i, pi + 2.718i))
117119
print(x)
118120

121+
old = options(datatable.show.indices=TRUE)
122+
NN = 200
123+
set.seed(2024)
124+
DT = data.table(
125+
grp1 = sample(100, NN, TRUE),
126+
grp2 = sample(90, NN, TRUE),
127+
grp3 = sample(80, NN, TRUE)
128+
)
129+
setkey(DT, grp1, grp2)
130+
setindex(DT, grp1, grp3)
131+
print(DT)
132+
options(old)
133+
119134
iris = as.data.table(iris)
120135
iris_agg = iris[ , .(reg = list(lm(Sepal.Length ~ Petal.Length))), by = Species]
121136
format_list_item.lm = function(x, ...) sprintf('<lm:\%s>', format(x$call$formula))

po/R-zh_CN.po

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -371,7 +371,7 @@ msgstr "变量 '%s' 并没有存在于调用环境中。之所以在调用环境
371371
msgid ""
372372
"Both '%1$s' and '..%1$s' exist in calling scope. Please remove the '..%1$s' "
373373
"variable in calling scope for clarity."
374-
msgstr "'%1%s'和'..%1$s'均在当前调用环境中。为清晰起见,请移除在调用环境中名为"
374+
msgstr "'%1$s'和'..%1$s'均在当前调用环境中。为清晰起见,请移除在调用环境中名为"
375375
"..%1$s' 的变量。"
376376

377377
#: data.table.R:288

vignettes/datatable-intro.Rmd

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,49 @@ The function `length()` requires an input argument. We just need to compute the
250250

251251
This type of operation occurs quite frequently, especially while grouping (as we will see in the next section), to the point where `data.table` provides a *special symbol* `.N` for it.
252252

253+
### g) Handle non-existing elements in `i`
254+
255+
#### -- What happens when querying for non-existing elements?
256+
257+
When querying a `data.table` for elements that do not exist, the behavior differs based on the method used.
258+
259+
```r
260+
setkeyv(flights, "origin")
261+
```
262+
263+
* **Key-based subsetting: `dt["d"]`**
264+
265+
This performs a right join on the key column `x`, resulting in a row with `d` and `NA` for columns not found. When using `setkeyv`, the table is sorted by the specified keys and an internal index is created, enabling binary search for efficient subsetting.
266+
267+
```r
268+
flights["XYZ"]
269+
# Returns:
270+
# origin year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum ...
271+
# 1: XYZ NA NA NA NA NA NA NA NA NA NA NA NA ...
272+
```
273+
274+
* **Logical subsetting: `dt[x == "d"]`**
275+
276+
This performs a standard subset operation that does not find any matching rows and thus returns an empty `data.table`.
277+
278+
```r
279+
flights[origin == "XYZ"]
280+
# Returns:
281+
# Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,...
282+
```
283+
284+
* **Exact match using `nomatch=NULL`**
285+
286+
For exact matches without `NA` for non-existing elements, use `nomatch=NULL`:
287+
288+
```r
289+
flights["XYZ", nomatch=NULL]
290+
# Returns:
291+
# Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,...
292+
```
293+
294+
Understanding these behaviors can help prevent confusion when dealing with non-existing elements in your data.
295+
253296
#### Special symbol `.N`: {#special-N}
254297

255298
`.N` is a special built-in variable that holds the number of observations _in the current group_. It is particularly useful when combined with `by` as we'll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.
@@ -269,7 +312,7 @@ ans
269312

270313
We could have accomplished the same operation by doing `nrow(flights[origin == "JFK" & month == 6L])`. However, it would have to subset the entire `data.table` first corresponding to the *row indices* in `i` *and then* return the rows using `nrow()`, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the *`data.table` design* vignette.
271314

272-
### g) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
315+
### h) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
273316

274317
If you're writing out the column names explicitly, there's no difference compared to a `data.frame` (since v1.9.8).
275318

vignettes/datatable-reference-semantics.Rmd

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,8 @@ A *shallow* copy is just a copy of the vector of column pointers (corresponding
7878

7979
A *deep* copy on the other hand copies the entire data to another location in memory.
8080

81+
When subsetting a *data.table* using `i` (e.g., `DT[1:10]`), a *deep* copy is made. However, when `i` is not provided or equals `TRUE`, a *shallow* copy is made.
82+
8183
#
8284
With *data.table's* `:=` operator, absolutely no copies are made in *both* (1) and (2), irrespective of R version you are using. This is because `:=` operator updates *data.table* columns *in-place* (by reference).
8385

0 commit comments

Comments
 (0)