Skip to content

Commit cde7333

Browse files
MichaelChiricoOfekShilonAnirban166davidbudzynskisritchie73
authored
Add an illustrative example to ?GForce when sorting locale matters (#5342)
* improve documentation for GForce where sorting affects the result * link issue * tests * typo * mention Sys.setlocale * obsolete comment * 1.15.0 on CRAN. Bump to 1.15.99 * Fix transform slowness (#5493) * Fix 5492 by limiting the costly deparse to `nlines=1` * Implementing PR feedbacks * Added inside * Fix typo in name * Idiomatic use of inside * Separating the deparse line limit to a different PR --------- Co-authored-by: Michael Chirico <[email protected]> * Improvements to the introductory vignette (#5836) * Added my improvements to the intro vignette * Removed two lines I added extra as a mistake earlier * Requested changes * Vignette typo patch (#5402) * fix typos and grammatical mistakes * fix typos and punctuation * remove double spaces where it wasn't necessary * fix typos and adhere to British English spelling * fix typos * fix typos * add missing closing bracket * fix typos * review fixes * Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico <[email protected]> * Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico <[email protected]> * Apply suggestions from code review benchmarking Co-authored-by: Michael Chirico <[email protected]> * remove unnecessary [ ] from datatable-keys-fast-subset.Rmd * Update vignettes/datatable-programming.Rmd Co-authored-by: Michael Chirico <[email protected]> * Update vignettes/datatable-reshape.Rmd Co-authored-by: Michael Chirico <[email protected]> * One last batch of fine-tuning --------- Co-authored-by: Michael Chirico <[email protected]> Co-authored-by: Michael Chirico <[email protected]> * fix bad merge * Improved handling of list columns with NULL entries (#4250) * Updated documentation for rbindlist(fill=TRUE) * Print NULL entries of list as NULL * Added news item * edit NEWS, use '[NULL]' not 'NULL' * fix test * split NEWS item * add example --------- Co-authored-by: Michael Chirico <[email protected]> Co-authored-by: Michael Chirico <[email protected]> Co-authored-by: Benjamin Schwendinger <[email protected]> * clarify that list input->unnamed list output (#5383) * clarify that list input->unnamed list output * Add example where make.names is used * mention role of make.names * revert from next release branch * manual merge NEWS * manual rebase tests * manual rebase data.table.R * clarify 0 turns off everything --------- Co-authored-by: Ofek <[email protected]> Co-authored-by: Ani <[email protected]> Co-authored-by: David Budzynski <[email protected]> Co-authored-by: Scott Ritchie <[email protected]> Co-authored-by: Benjamin Schwendinger <[email protected]>
1 parent 64e2041 commit cde7333

File tree

4 files changed

+30
-4
lines changed

4 files changed

+30
-4
lines changed

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -617,4 +617,6 @@
617617

618618
19. In the NEWS for v1.11.0 (May 2018), section "NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES" item 2, we stated the intention to eventually change `logical01` to be `TRUE` by default. After some consideration, reflection, and community input, we have decided not to move forward with this plan, i.e., `logical01` will remain `FALSE` by default in both `fread()` and `fwrite()`. See discussion in #5856; most importantly, we think changing the default would be a major disruption to reading "sharded" CSVs where data with the same schema is split into many files, some of which could be converted to `logical` while others remain `integer`. We will retain the option `datatable.logical01` for users who wish to use a different default -- for example, if you are doing input/output on tables with a large number of logical columns, where writing '0'/'1' to the CSV many millions of times is preferable to writing 'TRUE'/'FALSE'.
619619

620+
20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation.
621+
620622
# data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md)

R/data.table.R

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1733,7 +1733,7 @@ replace_dot_alias = function(e) {
17331733
GForce = FALSE
17341734
if ( ((is.name(jsub) && jsub==".N") || (jsub %iscall% 'list' && length(jsub)==2L && jsub[[2L]]==".N")) && !length(lhs) ) {
17351735
GForce = TRUE
1736-
if (verbose) catf("GForce optimized j to '%s'\n",deparse(jsub, width.cutoff=200L, nlines=1L))
1736+
if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n",deparse(jsub, width.cutoff=200L, nlines=1L))
17371737
}
17381738
} else if (length(lhs) && is.symbol(jsub)) { # turn off GForce for the combination of := and .N
17391739
GForce = FALSE
@@ -1818,8 +1818,8 @@ replace_dot_alias = function(e) {
18181818
if (length(jsub) == 2L && jsub[[1L]] %chin% c("head", "tail")) jsub[["n"]] = 6L
18191819
jsub = gforce_jsub(jsub, names_x)
18201820
}
1821-
if (verbose) catf("GForce optimized j to '%s'\n", deparse(jsub, width.cutoff=200L, nlines=1L))
1822-
} else if (verbose) catf("GForce is on, left j unchanged\n");
1821+
if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n", deparse(jsub, width.cutoff=200L, nlines=1L))
1822+
} else if (verbose) catf("GForce is on, but not activated for this query; left j unchanged (see ?GForce)\n");
18231823
}
18241824
}
18251825
if (!GForce && !is.name(jsub)) {

inst/tests/tests.Rraw

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14918,7 +14918,7 @@ test(2041.2, DT[, median(time), by=g], DT[c(2,5),.(g=g, V1=time)])
1491814918
# They could run in level 1 with level 2 off, but output= would need to be changed and there's no need.
1491914919
test(2042.1, DT[ , as.character(mean(date)), by=g, verbose=TRUE ],
1492014920
data.table(g=c("a","b"), V1=c("2018-01-04","2018-01-21")),
14921-
output=msg<-"GForce is on, left j unchanged.*Old mean optimization is on, left j unchanged")
14921+
output=msg<-"GForce is on, but not activated.*Old mean optimization is on, left j unchanged")
1492214922
# Since %b is e.g. "janv." in LC_TIME=fr_FR.UTF-8 locale, we need to
1492314923
# have the target/y value in these tests depend on the locale as well, #3450.
1492414924
Jan.2018 = format(strptime("2018-01-01", "%Y-%m-%d"), "%b-%Y")

man/datatable-optimize.Rd

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@ of these optimisations. They happen automatically.
2121
Run the code under the \emph{example} section to get a feel for the performance
2222
benefits from these optimisations.
2323
24+
Note that for all optimizations involving efficient sorts, the caveat mentioned
25+
in \code{\link{setorder}} applies -- whenever data.table does the sorting,
26+
it does so in "C-locale". This has some subtle implications; see Examples.
27+
2428
}
2529
\details{
2630
\code{data.table} reads the global option \code{datatable.optimize} to figure
@@ -101,6 +105,8 @@ indices set global option \code{options(datatable.use.index = FALSE)}.
101105
\seealso{ \code{\link{setNumericRounding}}, \code{\link{getNumericRounding}} }
102106
\examples{
103107
\dontrun{
108+
old = options(datatable.optimize = Inf)
109+
104110
# Generate a big data.table with a relatively many columns
105111
set.seed(1L)
106112
DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
@@ -151,6 +157,24 @@ system.time(ans1 <- DT[id == 100L]) # index + binary search subset
151157
system.time(ans2 <- DT[id == 100L]) # only binary search subset
152158
system.time(DT[id \%in\% 100:500]) # only binary search subset again
153159
160+
# sensitivity to collate order
161+
old_lc_collate = Sys.getlocale("LC_COLLATE")
162+
163+
if (old_lc_collate == "C") {
164+
Sys.setlocale("LC_COLLATE", "")
165+
}
166+
DT = data.table(
167+
grp = rep(1:2, each = 4L),
168+
var = c("A", "a", "0", "1", "B", "b", "0", "1")
169+
)
170+
options(datatable.optimize = Inf)
171+
DT[, .(max(var), min(var)), by=grp]
172+
# GForce is deactivated because of the ad-hoc column 'tolower(var)',
173+
# through which the result for 'max(var)' may also change
174+
DT[, .(max(var), min(tolower(var))), by=grp]
175+
176+
Sys.setlocale("LC_COLLATE", old_lc_collate)
177+
options(old)
154178
}}
155179
\keyword{ data }
156180

0 commit comments

Comments
 (0)