Merge branch 'master' into litedown

ben-schwen · ben-schwen · commit e1329ddd4406 · 2025-04-21T19:45:54.000+02:00
diff --git a/.github/CODE_OF_CONDUCT.md b/.github/CODE_OF_CONDUCT.md
@@ -24,7 +24,7 @@ Project members with the Committer role or the CRAN Maintainer role are pledged
 
 Those who prefer to report in a way that is independent of the current Committers and Maintainer may instead contact the Community Engagement Coordinator by e-mailing [r.data.table\@gmail.com](mailto:r.data.table@gmail.com). Messages sent to this e-mail address will be visible only to the current Community Engagement Coordinator, a position always held by an individual who is not a Committer or CRAN Maintainer of the package.
 
-The current Committers are Toby Dylan Hocking (@tdhock), Matt Dowle (@mattdowle), Arun Srinivasan (@arunsrinivasan), Jan Gorecki (@jangorecki), Michael Chirico (@MichaelChirico), and Benjamin Schwendinger (@ben-schwen).
+The current Committers are Toby Dylan Hocking (@tdhock), Matt Dowle (@mattdowle), Arun Srinivasan (@arunsrinivasan), Jan Gorecki (@jangorecki), Michael Chirico (@MichaelChirico), Benjamin Schwendinger (@ben-schwen), and Ivan Krylov (@aitap).
 
 The current CRAN Maintainer is Tyson Barrett (@tysonstanley).
 
diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml
@@ -27,8 +27,9 @@ jobs:
           # GHA does run these jobs concurrently but even so reducing the load seems like a good idea.
           - {os: windows-latest, r: 'devel'}
           # - {os: macOS-latest, r: 'release'}    # test-coverage.yaml uses macOS
-          - {os: ubuntu-20.04, r: 'release', rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest"}
-          # - {os: ubuntu-20.04,   r: 'devel', rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest", http-user-agent: "R/4.1.0 (ubuntu-20.04) R (4.1.0 x86_64-pc-linux-gnu x86_64 linux-gnu) on GitHub Actions" }
+          # TODO(remotes>2.5.0): Use 24.04[noble?]
+          - {os: ubuntu-22.04, r: 'release', rspm: "https://packagemanager.rstudio.com/cran/__linux__/jammy/latest"}
+          # - {os: ubuntu-22.04,   r: 'devel', rspm: "https://packagemanager.rstudio.com/cran/__linux__/jammy/latest", http-user-agent: "R/4.1.0 (ubuntu-22.04) R (4.1.0 x86_64-pc-linux-gnu x86_64 linux-gnu) on GitHub Actions" }
           #   GLCI covers R-devel; no need to delay contributors in dev due to changes in R-devel in recent days
 
     env:
@@ -64,7 +65,7 @@ jobs:
           while read -r cmd
           do
             eval sudo $cmd
-          done < <(Rscript -e 'writeLines(remotes::system_requirements("ubuntu", "20.04"))')
+          done < <(Rscript -e 'writeLines(remotes::system_requirements("ubuntu", "22.04"))')
 
       - name: Install dependencies
         run: |
diff --git a/CODEOWNERS b/CODEOWNERS
@@ -23,6 +23,10 @@
 /src/programming.c @jangorecki
 /vignettes/datatable-programming.Rmd @jangorecki
 
+# roll-up & setops
+/R/groupingsets.R @jangorecki
+/R/setops.R @jangorecki
+
 # GForce groupby
 /src/gsumm.c @ben-schwen
 # datetime classes
diff --git a/NEWS.md b/NEWS.md
@@ -6,6 +6,10 @@
 
 1. New `sort_by()` method for data.tables, [#6662](https://github.com/Rdatatable/data.table/issues/6662). It uses `forder()` to improve upon the data.frame method and also match `DT[order(...)]` behavior with respect to locale. Thanks @rikivillalba for the suggestion and PR.
 
+2. `melt()` now supports using `patterns()` with `id.vars`, [#6867](https://github.com/Rdatatable/data.table/issues/6867). Thanks to Toby Dylan Hocking for the suggestion and PR.
+
+3. `print.data.table()` now shows column classes at the bottom of large tables when `class=TRUE` and `col.names="auto"` (default) for tables with more than 20 rows, [#6902](https://github.com/Rdatatable/data.table/issues/6902). This follows the same behavior as column names at the bottom, making it easier to see column types for large tables without scrolling back to the top. Thanks to @TimTaylor for the suggestion and @Mukulyadav2004 for the PR.
+
 ## BUG FIXES
 
 1. Custom binary operators from the `lubridate` package now work with objects of class `IDate` as with a `Date` subclass, [#6839](https://github.com/Rdatatable/data.table/issues/6839). Thanks @emallickhossain for the report and @aitap for the fix.
@@ -14,10 +18,15 @@
 
 3. `fread(keepLeadingZeros=TRUE)` now correctly parses dates with components with leading zeros as dates instead of strings, [#6851](https://github.com/Rdatatable/data.table/issues/6851). Thanks @TurnaevEvgeny for the report and @ben-schwen for the fix.
 
-4. `as.data.table()` now properly handles keys: specifying keys sets them, omitting keys preserves existing ones, and setting `key=NULL` clears them, [#6859](https://github.com/Rdatatable/data.table/issues/6859). Thanks @brookslogan for the report and @Mukulyadav2004 for the fix.
+4. `as.data.table()` now properly handles keys: specifying keys sets them, omitting keys preserves existing ones, and setting `key=NULL` clears them. Additionally, `keep.rownames` is now consistently passed to `as.data.table(x, keep.rownames)`, [#6859](https://github.com/Rdatatable/data.table/issues/6859). Thanks @brookslogan for the report and @Mukulyadav2004 for the fix.
 
 5. `as.data.table()` on `x` avoids an infinite loop if the output of the corresponding `as.data.frame()` method has the same class as the input, [#6874](https://github.com/Rdatatable/data.table/issues/6874). Concretely, we had `class(x) = c('foo', 'data.frame')` and `class(as.data.frame(x)) = c('foo', 'data.frame')`, so `as.data.frame.foo` wound up getting called repeatedly. Thanks @matschmitz for the report and @ben-schwen for the fix.
 
+6. By-reference sub-assignments to factor columns now match the levels in UTF-8, preventing their duplication when the same level exists in different encodings, [#6886](https://github.com/Rdatatable/data.table/issues/6886). Thanks @iagogv3 for the report and @aitap for the fix.
+
+7. `fwrite()` now avoids a crash when translating strings into a different encoding, [#6883](https://github.com/Rdatatable/data.table/issues/6883). Thanks @filipemsc for the report and @aitap for the fix.
+
+
 ## NOTES
 
 1. Continued work to remove non-API C functions, [#6180](https://github.com/Rdatatable/data.table/issues/6180). Thanks Ivan Krylov for the PRs and for writing a clear and concise guide about the R API: https://aitap.codeberg.page/R-api/.
diff --git a/R/fmelt.R b/R/fmelt.R
@@ -182,13 +182,17 @@ melt.data.table = function(data, id.vars, measure.vars, variable.name = "variabl
        value.name = "value", ..., na.rm = FALSE, variable.factor = TRUE, value.factor = FALSE,
        verbose = getOption("datatable.verbose")) {
   if (!is.data.table(data)) stopf("'data' must be a data.table")
-  if (missing(id.vars)) id.vars=NULL
-  if (missing(measure.vars)) measure.vars = NULL
-  measure.sub = substitute(measure.vars)
-  if (is.call(measure.sub)) {
-    eval.result = eval_with_cols(measure.sub, names(data))
-    if (!is.null(eval.result)) {
-      measure.vars = eval.result
+  for(type.vars in c("id.vars","measure.vars")){
+    sub.lang <- substitute({
+      if (missing(VAR)) VAR=NULL
+      substitute(VAR)
+    }, list(VAR=as.symbol(type.vars)))
+    sub.result = eval(sub.lang)
+    if (is.call(sub.result)) {
+      eval.result = eval_with_cols(sub.result, names(data))
+      if (!is.null(eval.result)) {
+        assign(type.vars, eval.result)
+      }
     }
   }
   if (is.list(measure.vars)) {
diff --git a/R/fwrite.R b/R/fwrite.R
@@ -111,6 +111,15 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
   }
   # nocov end
   file = enc2native(file) # CfwriteR cannot handle UTF-8 if that is not the native encoding, see #3078.
+  # pre-encode any strings or factor levels to avoid translateChar trying to allocate from OpenMP threads
+  if (encoding %chin% c("UTF-8", "native")) {
+    enc = switch(encoding, "UTF-8" = enc2utf8, "native" = enc2native)
+    x = lapply(x, function(x) {
+      if (is.character(x)) x = enc(x)
+      if (is.factor(x)) levels(x) = enc(levels(x))
+      x
+    })
+  }
   .Call(CfwriteR, x, file, sep, sep2, eol, na, dec, quote, qmethod=="escape", append,
         row.names, col.names, logical01, scipen, dateTimeAs, buffMB, nThread,
         showProgress, is_gzip, compressLevel, bom, yaml, verbose, encoding)
diff --git a/R/print.data.table.R b/R/print.data.table.R
@@ -142,7 +142,11 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
   if (nrow(toprint)>20L && col.names == "auto")
     # repeat colnames at the bottom if over 20 rows so you don't have to scroll up to see them
     #   option to shut this off per request of Oleg Bondar on SO, #1482
-    toprint = rbind(toprint, matrix(if (quote) old else colnames(toprint), nrow=1L)) # fixes bug #97
+    toprint = rbind(
+      toprint,
+      matrix(if (quote) old else colnames(toprint), nrow=1L), # see #97
+      if (isTRUE(class)) matrix(abbs, nrow=1L) # #6902
+    )
   print_default(toprint)
   invisible(x)
 }
diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -2,6 +2,7 @@ url: https://rdatatable.gitlab.io/data.table
 
 template:
   bootstrap: 5
+  light-switch: true
 
 development:
   version_tooltip: "Development version"
@@ -18,7 +19,7 @@ home:
 navbar:
   structure:
     left:  [home, introduction, articles, news, benchmarks, presentations, communityarticles, reference]
-    right: [github]
+    right: [search, github, lightswitch]
   components:
     home:
       icon: fas fa-home fa-lg
diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
@@ -2832,9 +2832,8 @@ test(944.1, DT[, foo:=NULL], DT, warning="Tried to assign NULL to column 'foo',
 test(944.2, DT[,a:=1L], data.table(a=1L))  # can now add columns to an empty data.table from v1.12.2
 test(944.3, DT[,aa:=NULL], data.table(a=1L), warning="Tried to assign NULL to column 'aa', but this column does not exist to remove")
 test(944.4, DT[,a:=NULL], data.table(NULL))
-if (base::getRversion() >= "3.4.0") {
-  test(944.5, typeof(structure(NULL, class=c("data.table","data.frame"))), 'list', warning="deprecated, as NULL cannot have attributes")  # R warns which is good and we like
-}
+# 944.5 used to test base R behaviour regarding structure(NULL, ...), which changed from warning to error in 4.6.0 and isn't used in data.table.
+
 DT = data.table(a=numeric())
 test(945, DT[,b:=a+1], data.table(a=numeric(),b=numeric()))
 
@@ -3221,6 +3220,10 @@ test(1034, as.data.table(x<-as.character(sample(letters, 5))), data.table(V1=x))
   test(1035.12, attr(melt(DT, id.vars=1:2)$x, "foo"), "bla1")
   test(1035.13, attr(melt(DT, id.vars=1:2)$y, "bar"), 1:4)
 
+  # issue #6867 - id.vars=patterns().
+  DT=data.table(x_long=0, x_short=0, z=0, y1=1, y2=2)
+  test(1035.131, melt(DT, measure.vars=patterns("y"), id.vars=patterns("x")), data.table(x_long=0, x_short=0, variable=factor(c("y1","y2")), value=c(1,2)))
+
   # bug #699 - melt segfaults when vars are not in dt; was test 1316
   x = data.table(a=c(1,2),b=c(2,3),c=c(3,4))
   test(1035.14, melt(x, id.vars="d"), error="One or more values")
@@ -21105,7 +21108,36 @@ test(2309.06, key(as.data.table(DT, key="a")), "a")
 test(2309.07, key(as.data.table(DT)), NULL)          
 test(2309.08, key(as.data.table(DT, key=NULL)), NULL) 
 
+# as.data.table(x, keep.rownames=TRUE) keeps rownames for class(x)==c("*", "data.frame")
+df = structure(list(i = 1:2), class = c("tbl", "data.frame"), row.names = c("a","b"))
+test(2309.09, as.data.table(df, keep.rownames=TRUE), data.table(rn = c("a","b"), i=1:2))
+
 # as.data.frame(x) does not reset class(x) to "data.frame" #6874
 as.data.frame.no.reset = function(x) x
 DF = structure(list(a = 1:2), class = c("data.frame", "no.reset"), row.names = c(NA, -2L))
 test(2310.01, as.data.table(DF), data.table(a=1:2))
+
+# memrecycle() did not consider string encodings for factor levels #6886
+DT = data.table(factor(rep("\uf8", 3)))
+# identical() to V1's only level but stored in a different CHARSXP
+samelevel = iconv(levels(DT$V1), from = "UTF-8", to = "latin1")
+DT[1, V1 := samelevel]
+test(2311.1, nlevels(DT$V1), 1L) # used to be 2
+DT[1, V1 := factor("a", levels = c("a", samelevel))]
+test(2311.2, nlevels(DT$V1), 2L) # used to be 3
+
+# avoid translateChar*() in OpenMP threads, #6883
+DF = list(rep(iconv("\uf8", from = "UTF-8", to = "latin1"), 2e6))
+test(2312, fwrite(DF, nullfile(), encoding = "UTF-8", nThread = 2L), NULL)
+
+# avoid memcpy of 0-length inputs
+test(2313,
+     melt(data.table(a=numeric(), b=numeric(), c=numeric()), id.vars=c('a', 'b')),
+     data.table(a=numeric(), b=numeric(), variable=factor(levels='c'), value=numeric()))
+
+# Testing column footer display with col.names options in print.data.table #6902
+dt = data.table(id = 1:25)
+# Test with class=TRUE shows classes at bottom with default col.names="auto"
+test(2314.1, any(grepl("<int>", tail(capture.output(print(dt, class = TRUE)), 2))), TRUE)
+# Test that class=TRUE with col.names="top" doesn't show classes at bottom
+test(2314.2, !any(grepl("<int>", tail(capture.output(print(dt, class = TRUE, col.names = "top")), 2))), TRUE)
diff --git a/man/melt.data.table.Rd b/man/melt.data.table.Rd
@@ -19,7 +19,7 @@ multiple columns simultaneously.
 \arguments{
 \item{data}{ A \code{data.table} object to melt.}
 \item{id.vars}{vector of id variables. Can be integer (corresponding id
-column numbers) or character (id column names) vector. If missing, all
+column numbers) or character (id column names) vector, perhaps created using \code{patterns()}. If missing, all
 non-measure columns will be assigned to it. If integer, must be positive; see Details. }
 \item{measure.vars}{Measure variables for \code{melt}ing. Can be missing, vector, list, or pattern-based.
 
@@ -131,6 +131,7 @@ melt(DT, id.vars=1, measure.vars=c("c_1", "c_2"), na.rm=TRUE) # remove NA
 # melt "f_1,f_2" and "d_1,d_2" simultaneously, retain 'factor' attribute
 # convenient way using internal function patterns()
 melt(DT, id.vars=1:2, measure.vars=patterns("^f_", "^d_"), value.factor=TRUE)
+melt(DT, id.vars=patterns("[in]"), measure.vars=patterns("^f_", "^d_"), value.factor=TRUE)
 # same as above, but provide list of columns directly by column names or indices
 melt(DT, id.vars=1:2, measure.vars=list(3:4, c("d_1", "d_2")), value.factor=TRUE)
 # same as above, but provide names directly:
diff --git a/man/print.data.table.Rd b/man/print.data.table.Rd
@@ -41,9 +41,9 @@
   \item{x}{ A \code{data.table}. }
   \item{topn}{ The number of rows to be printed from the beginning and end of tables with more than \code{nrows} rows. }
   \item{nrows}{ The number of rows which will be printed before truncation is enforced. }
-  \item{class}{ If \code{TRUE}, the resulting output will include above each column its storage class (or a self-evident abbreviation thereof). }
+  \item{class}{ If \code{TRUE}, the resulting output will include above each column its storage class (or a self-evident abbreviation thereof). When combined with \code{col.names="auto"} and tables >20 rows, classes will also appear at the bottom.}
   \item{row.names}{ If \code{TRUE}, row indices will be printed alongside \code{x}. }
-  \item{col.names}{ One of three flavours for controlling the display of column names in output. \code{"auto"} includes column names above the data, as well as below the table if \code{nrow(x) > 20}. \code{"top"} excludes this lower register when applicable, and \code{"none"} suppresses column names altogether (as well as column classes if \code{class = TRUE}. }
+  \item{col.names}{ One of three flavours for controlling the display of column names in output. \code{"auto"} includes column names above the data, as well as below the table if \code{nrow(x) > 20} (when \code{class=TRUE}, column classes will also appear at the bottom). \code{"top"} excludes this lower register when applicable, and \code{"none"} suppresses column names altogether (as well as column classes if \code{class = TRUE}. }
   \item{print.keys}{ If \code{TRUE}, any \code{\link{key}} and/or \code{\link[=indices]{index}} currently assigned to \code{x} will be printed prior to the preview of the data. }
   \item{trunc.cols}{ If \code{TRUE}, only the columns that can be printed in the console without wrapping the columns to new lines will be printed (similar to \code{tibbles}). }
   \item{show.indices}{ If \code{TRUE}, indices will be printed as columns alongside \code{x}. }
diff --git a/man/setDT.Rd b/man/setDT.Rd
@@ -28,7 +28,8 @@ setDT(x, keep.rownames=FALSE, key=NULL, check.names=FALSE)
 }
 
 \seealso{
-  \code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setkey}}, \code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setorder}}
+  \code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setkey}}, \code{\link{setcolorder}}, \code{\link{setattr}}, \code{\link{setnames}}, \code{\link{set}}, \code{\link{:=}}, \code{\link{setorder}},
+  See the FAQ vignette: \code{vignette("datatable-faq", package = "data.table")}.
 }
 \examples{
 
@@ -58,6 +59,14 @@ setDT(X, key="a")[]
 X = list(a=1:5, a=6:10)
 setDT(X, check.names=TRUE)[]
 
+# Example demonstrating setDT after loading from RDS
+rds_file = tempfile(fileext = ".rds")
+X = data.table(a = 1:5, b = letters[1:5])
+saveRDS(X, rds_file)
+X_loaded = readRDS(rds_file)
+setDT(X_loaded)  # restore internal data.table attributes
+print(X_loaded)
+unlink(rds_file)
 }
 \keyword{ data }
 
diff --git a/man/truelength.Rd b/man/truelength.Rd
diff --git a/src/assign.c b/src/assign.c
diff --git a/src/dogroups.c b/src/dogroups.c
diff --git a/src/fmelt.c b/src/fmelt.c
diff --git a/src/forder.c b/src/forder.c
diff --git a/vignettes/datatable-faq.Rmd b/vignettes/datatable-faq.Rmd