Add an illustrative example to ?GForce when sorting locale matters (#5342)

MichaelChirico · OfekShilon · Anirban166 · web-flow · commit cde7333938a5 · 2024-01-13T01:10:42.000+08:00
* improve documentation for GForce where sorting affects the result * link issue * tests * typo * mention Sys.setlocale * obsolete comment * 1.15.0 on CRAN. Bump to 1.15.99 * Fix transform slowness (#5493) * Fix 5492 by limiting the costly deparse to `nlines=1` * Implementing PR feedbacks * Added inside * Fix typo in name * Idiomatic use of inside * Separating the deparse line limit to a different PR --------- Co-authored-by: Michael Chirico <chiricom@google.com> * Improvements to the introductory vignette (#5836) * Added my improvements to the intro vignette * Removed two lines I added extra as a mistake earlier * Requested changes * Vignette typo patch (#5402) * fix typos and grammatical mistakes * fix typos and punctuation * remove double spaces where it wasn't necessary * fix typos and adhere to British English spelling * fix typos * fix typos * add missing closing bracket * fix typos * review fixes * Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> * Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> * Apply suggestions from code review benchmarking Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> * remove unnecessary [ ] from datatable-keys-fast-subset.Rmd * Update vignettes/datatable-programming.Rmd Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> * Update vignettes/datatable-reshape.Rmd Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> * One last batch of fine-tuning --------- Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> Co-authored-by: Michael Chirico <chiricom@google.com> * fix bad merge * Improved handling of list columns with NULL entries (#4250) * Updated documentation for rbindlist(fill=TRUE) * Print NULL entries of list as NULL * Added news item * edit NEWS, use '[NULL]' not 'NULL' * fix test * split NEWS item * add example --------- Co-authored-by: Michael Chirico <chiricom@google.com> Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at> * clarify that list input->unnamed list output (#5383) * clarify that list input->unnamed list output * Add example where make.names is used * mention role of make.names * revert from next release branch * manual merge NEWS * manual rebase tests * manual rebase data.table.R * clarify 0 turns off everything --------- Co-authored-by: Ofek <ofekshilon@gmail.com> Co-authored-by: Ani <bloodraven166@gmail.com> Co-authored-by: David Budzynski <56514985+davidbudzynski@users.noreply.github.com> Co-authored-by: Scott Ritchie <sritchie73@gmail.com> Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>
diff --git a/NEWS.md b/NEWS.md
@@ -617,4 +617,6 @@
 
 19. In the NEWS for v1.11.0 (May 2018), section "NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES" item 2, we stated the intention to eventually change `logical01` to be `TRUE` by default. After some consideration, reflection, and community input, we have decided not to move forward with this plan, i.e., `logical01` will remain `FALSE` by default in both `fread()` and `fwrite()`. See discussion in #5856; most importantly, we think changing the default would be a major disruption to reading "sharded" CSVs where data with the same schema is split into many files, some of which could be converted to `logical` while others remain `integer`. We will retain the option `datatable.logical01` for users who wish to use a different default -- for example, if you are doing input/output on tables with a large number of logical columns, where writing '0'/'1' to the CSV many millions of times is preferable to writing 'TRUE'/'FALSE'.
 
+20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation.
+
 # data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md)
diff --git a/R/data.table.R b/R/data.table.R
@@ -1733,7 +1733,7 @@ replace_dot_alias = function(e) {
         GForce = FALSE
         if ( ((is.name(jsub) && jsub==".N") || (jsub %iscall% 'list' && length(jsub)==2L && jsub[[2L]]==".N")) && !length(lhs) ) {
           GForce = TRUE
-          if (verbose) catf("GForce optimized j to '%s'\n",deparse(jsub, width.cutoff=200L, nlines=1L))
+          if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n",deparse(jsub, width.cutoff=200L, nlines=1L))
         }
       } else if (length(lhs) && is.symbol(jsub)) { # turn off GForce for the combination of := and .N
         GForce = FALSE
@@ -1818,8 +1818,8 @@ replace_dot_alias = function(e) {
             if (length(jsub) == 2L && jsub[[1L]] %chin% c("head", "tail")) jsub[["n"]] = 6L
             jsub = gforce_jsub(jsub, names_x)
           }
-          if (verbose) catf("GForce optimized j to '%s'\n", deparse(jsub, width.cutoff=200L, nlines=1L))
-        } else if (verbose) catf("GForce is on, left j unchanged\n");
+          if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n", deparse(jsub, width.cutoff=200L, nlines=1L))
+        } else if (verbose) catf("GForce is on, but not activated for this query; left j unchanged (see ?GForce)\n");
       }
     }
     if (!GForce && !is.name(jsub)) {
diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
@@ -14918,7 +14918,7 @@ test(2041.2, DT[, median(time), by=g], DT[c(2,5),.(g=g, V1=time)])
 # They could run in level 1 with level 2 off, but output= would need to be changed and there's no need.
 test(2042.1, DT[ , as.character(mean(date)), by=g, verbose=TRUE ],
              data.table(g=c("a","b"), V1=c("2018-01-04","2018-01-21")),
-     output=msg<-"GForce is on, left j unchanged.*Old mean optimization is on, left j unchanged")
+     output=msg<-"GForce is on, but not activated.*Old mean optimization is on, left j unchanged")
 # Since %b is e.g. "janv." in LC_TIME=fr_FR.UTF-8 locale, we need to
 # have the target/y value in these tests depend on the locale as well, #3450.
 Jan.2018 = format(strptime("2018-01-01", "%Y-%m-%d"), "%b-%Y")
diff --git a/man/datatable-optimize.Rd b/man/datatable-optimize.Rd
@@ -21,6 +21,10 @@ of these optimisations. They happen automatically.
 Run the code under the \emph{example} section to get a feel for the performance
 benefits from these optimisations.
 
+Note that for all optimizations involving efficient sorts, the caveat mentioned
+in \code{\link{setorder}} applies -- whenever data.table does the sorting,
+it does so in "C-locale". This has some subtle implications; see Examples.
+
 }
 \details{
 \code{data.table} reads the global option \code{datatable.optimize} to figure
@@ -101,6 +105,8 @@ indices set global option \code{options(datatable.use.index = FALSE)}.
 \seealso{ \code{\link{setNumericRounding}}, \code{\link{getNumericRounding}} }
 \examples{
 \dontrun{
+old = options(datatable.optimize = Inf)
+
 # Generate a big data.table with a relatively many columns
 set.seed(1L)
 DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
@@ -151,6 +157,24 @@ system.time(ans1 <- DT[id == 100L]) # index + binary search subset
 system.time(ans2 <- DT[id == 100L]) # only binary search subset
 system.time(DT[id \%in\% 100:500])    # only binary search subset again
 
+# sensitivity to collate order
+old_lc_collate = Sys.getlocale("LC_COLLATE")
+
+if (old_lc_collate == "C") {
+  Sys.setlocale("LC_COLLATE", "")
+}
+DT = data.table(
+  grp = rep(1:2, each = 4L),
+  var = c("A", "a", "0", "1", "B", "b", "0", "1")
+)
+options(datatable.optimize = Inf)
+DT[, .(max(var), min(var)), by=grp]
+# GForce is deactivated because of the ad-hoc column 'tolower(var)',
+#   through which the result for 'max(var)' may also change
+DT[, .(max(var), min(tolower(var))), by=grp]
+
+Sys.setlocale("LC_COLLATE", old_lc_collate)
+options(old)
 }}
 \keyword{ data }
 

Original file line number	Diff line number	Diff line change
`@@ -1733,7 +1733,7 @@ replace_dot_alias = function(e) {`
`1733`	`1733`	`GForce = FALSE`
`1734`	`1734`	`if ( ((is.name(jsub) && jsub==".N") \|\| (jsub %iscall% 'list' && length(jsub)==2L && jsub[[2L]]==".N")) && !length(lhs) ) {`
`1735`	`1735`	`GForce = TRUE`
`1736`		`- if (verbose) catf("GForce optimized j to '%s'\n",deparse(jsub, width.cutoff=200L, nlines=1L))`
	`1736`	`+ if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n",deparse(jsub, width.cutoff=200L, nlines=1L))`
`1737`	`1737`	`}`
`1738`	`1738`	`} else if (length(lhs) && is.symbol(jsub)) { # turn off GForce for the combination of := and .N`
`1739`	`1739`	`GForce = FALSE`
`@@ -1818,8 +1818,8 @@ replace_dot_alias = function(e) {`
`1818`	`1818`	`if (length(jsub) == 2L && jsub[[1L]] %chin% c("head", "tail")) jsub[["n"]] = 6L`
`1819`	`1819`	`jsub = gforce_jsub(jsub, names_x)`
`1820`	`1820`	`}`
`1821`		`- if (verbose) catf("GForce optimized j to '%s'\n", deparse(jsub, width.cutoff=200L, nlines=1L))`
`1822`		`- } else if (verbose) catf("GForce is on, left j unchanged\n");`
	`1821`	`+ if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n", deparse(jsub, width.cutoff=200L, nlines=1L))`
	`1822`	`+ } else if (verbose) catf("GForce is on, but not activated for this query; left j unchanged (see ?GForce)\n");`
`1823`	`1823`	`}`
`1824`	`1824`	`}`
`1825`	`1825`	`if (!GForce && !is.name(jsub)) {`