Skip to content

Commit 5921b46

Browse files
authored
Merge branch 'master' into omp_limits
2 parents 7404bcd + a8ad00f commit 5921b46

File tree

7 files changed

+57
-9
lines changed

7 files changed

+57
-9
lines changed

DESCRIPTION

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,5 +104,6 @@ Authors@R: c(
104104
person("Reino", "Bruner", role="ctb"),
105105
person(given="@badasahog", role="ctb", comment="GitHub user"),
106106
person("Vinit", "Thakur", role="ctb"),
107-
person("Mukul", "Kumar", role="ctb")
107+
person("Mukul", "Kumar", role="ctb"),
108+
person("Ildikó", "Czeller", role="ctb")
108109
)

NEWS.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -340,7 +340,9 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
340340
341341
20. `forderv` could segfault on keys with long runs of identical bytes (e.g., many duplicate columns) because the single-group branch tail-recursed radix-by-radix until the C stack ran out, [#4300](https://github.com/Rdatatable/data.table/issues/4300). This is a major problem since sorting is extensively used in `data.table`. Thanks @quantitative-technologies for the report and @ben-schwen for the fix.
342342
343-
21. `setDTthreads(percent=)` and `setDTthreads(threads=)` now respect `OMP_NUM_THREADS` and `omp_get_max_threads()`, ensuring consistency with `setDTthreads()` (no arguments) when OpenMP environment variables are set, [#7165](https://github.com/Rdatatable/data.table/issues/7165). Previously, explicitly setting a thread count or percentage would ignore these OpenMP limits, potentially exceeding the user's intended thread cap. Thanks to @bastistician for the report and @ben-schwen for the fix.
343+
21. `[` now preserves existing key(s) when new columns are added before them, instead of incorrectly setting a new column as key, [#7364](https://github.com/Rdatatable/data.table/issues/7364). Thanks @czeildi for the bug report and the fix.
344+
345+
22. `setDTthreads(percent=)` and `setDTthreads(threads=)` now respect `OMP_NUM_THREADS` and `omp_get_max_threads()`, ensuring consistency with `setDTthreads()` (no arguments) when OpenMP environment variables are set, [#7165](https://github.com/Rdatatable/data.table/issues/7165). Previously, explicitly setting a thread count or percentage would ignore these OpenMP limits, potentially exceeding the user's intended thread cap. Thanks to @bastistician for the report and @ben-schwen for the fix.
344346

345347
### NOTES
346348

@@ -538,6 +540,8 @@ rowwiseDT(
538540

539541
21. `setDT(get0('var'))` now correctly modifies `var` by reference, consistent with the long-standing behavior of `setDT(get('var'))`, [#6864](https://github.com/Rdatatable/data.table/issues/6864). Thanks to @rikivillalba for the report and @venom1204 for the fix.
540542

543+
22. `fread()` could fail to read Mac CSV files (with `\r` line endings) if the file contained any `\n` character, such as a final `\r\n`. This was fixed by detecting the predominant line ending in a sample of the file, [#4186](https://github.com/Rdatatable/data.table/issues/4186). Thanks to @MPagel for the report and @ben-schwen for the fix.
544+
541545
### NOTES
542546

543547
1. There is a new vignette on joins! See `vignette("datatable-joins")`. Thanks to Angel Feliz for authoring it! Feedback welcome. This vignette has been highly requested since 2017: [#2181](https://github.com/Rdatatable/data.table/issues/2181).

R/data.table.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1448,7 +1448,7 @@ replace_dot_alias = function(e) {
14481448
if (SD_only)
14491449
jvnames = jnames = sdvars
14501450
else
1451-
jnames = as.character(Filter(is.name, jsub)[-1L])
1451+
jnames = vapply_1c(jsub, function(x) if (is.name(x)) as.character(x) else NA_character_)[-1L]
14521452
key_idx = chmatch(key, jnames)
14531453
missing_keys = which(is.na(key_idx))
14541454
if (length(missing_keys) && missing_keys[1L] == 1L) return(NULL)

inst/tests/tests.Rraw

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11952,6 +11952,8 @@ tt = setDT(read.csv(f, stringsAsFactors=FALSE))
1195211952
tt[2, B:=gsub("\n","\r",B)] # base R changes the \r to a \n, so restore that
1195311953
test(1778.4, tt, DT)
1195411954
unlink(f)
11955+
# fread has problems with mixed \r and \r\n #4186
11956+
test(1778.5, fread(text="Col1;Col2\r1;data1\r2;data2\r3;data3\r\n", verbose=TRUE), data.table(Col1=1:3, Col2=c("data1","data2","data3")), output="An \\r")
1195511957

1195611958
# #1392 IDate ITime new methods for faster conversion
1195711959
# conversion in-out match for UTC
@@ -21835,3 +21837,21 @@ DT[, V1000 := 20:1]
2183521837
test(2343.1, forderv(DT, by=names(DT), sort=FALSE, retGrp=TRUE), forderv(DT, by=c("V1", "V1000"), sort=FALSE, retGrp=TRUE))
2183621838
x = c(rep(0, 7e5), 1e6)
2183721839
test(2343.2, forderv(list(x)), integer(0))
21840+
21841+
# Keep key when new column added before existing key in j
21842+
# Incorrect key can lead to incorrect join result #7364
21843+
DT = data.table(V1 = 1:2, key = "V1")
21844+
test(2344.00, key(DT[, .(V2 = c("b", "a"), V1)]), "V1")
21845+
test(2344.01, key(DT[, .(V2 = -V1, V1)]), "V1")
21846+
21847+
d1 = data.table(V1 = c(1L, 0L, 1L), V2 = c("a", "a", "b"), key = "V2")
21848+
d2 = d1[, .(V1, label = c("one", "zero", "one"), V2)]
21849+
r = d2[data.table(label = "one"), on = "label", allow.cartesian = TRUE]
21850+
test(2344.02, nrow(r), 2L)
21851+
# join result of keyed input is the same as unkeyed input
21852+
test(2344.03, setkey(d1[, .(V1, label = c("one", "zero", "one"), V2)][data.table(label = "one"), on = "label", allow.cartesian = TRUE], NULL),
21853+
setkey(d1, NULL)[, .(V1, label = c("one", "zero", "one"), V2)][data.table(label = "one"), on = "label", allow.cartesian = TRUE])
21854+
21855+
# keep sub-key in case of multiple keys, even with new columns and changing column order
21856+
DT = data.table(V1 = 1:2, V2 = 3:4, V3 = 5:6, key = c("V1", "V2", "V3"))
21857+
test(2344.04, key(DT[, .(V4 = c("b", "a"), V2, V5 = c("y", "x"), V1)]), c("V1", "V2"))

man/measure.Rd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ melt(who, measure.vars = measure(diagnosis, gender, ages, pattern="new_?(.*)_(.)
8383
print(melt(who, measure.vars = measure(
8484
diagnosis, gender, ages,
8585
ymin=as.numeric,
86-
ymax=function(y)ifelse(y=="", Inf, as.numeric(y)),
86+
ymax=function(y)ifelse(nzchar(y), as.numeric(y), Inf),
8787
pattern="new_?(.*)_(.)(([0-9]{2})([0-9]{0,2}))"
8888
)), class=TRUE)
8989
}

src/fread.c

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1628,12 +1628,35 @@ int freadMain(freadMainArgs _args)
16281628
if (verbose) DTPRINT(_("[04] Arrange mmap to be \\0 terminated\n"));
16291629

16301630
// First, set 'eol_one_r' for use by eol() to know if \r-only line ending is allowed, #2371
1631+
// Count different line ending types to handle mixed endings (e.g. Mac CSV with mostly \r and final \r\n) #4186
1632+
int count_r_only = 0; // \r not followed by \n
1633+
int count_with_n = 0; // \n with or without \r
16311634
ch = sof;
1632-
while (ch < eof && *ch != '\n') ch++;
1633-
eol_one_r = (ch == eof);
1635+
const char *sample_end = eof;
1636+
if ((size_t)(eof - sof) > 100000) sample_end = sof + 100000; // Sample first 100KB or whole file if smaller
1637+
while (ch < sample_end) {
1638+
if (*ch == '\r') {
1639+
// Skip consecutive \r to avoid miscounting \r\r\n as multiple line endings
1640+
while (ch < sample_end && *ch == '\r') ch++;
1641+
if (ch < sample_end && *ch == '\n') {
1642+
count_with_n++;
1643+
ch++;
1644+
} else {
1645+
count_r_only++;
1646+
}
1647+
} else if (*ch == '\n') {
1648+
count_with_n++;
1649+
ch++;
1650+
} else {
1651+
ch++;
1652+
}
1653+
}
1654+
// If file has mostly \r-only line endings, treat \r as line ending
1655+
eol_one_r = (count_r_only > count_with_n);
16341656
if (verbose) DTPRINT(eol_one_r ?
1635-
_(" No \\n exists in the file at all, so single \\r (if any) will be taken as one line ending. This is unusual but will happen normally when there is no \\r either; e.g. a single line missing its end of line.\n") :
1636-
_(" \\n has been found in the input and different lines can end with different line endings (e.g. mixed \\n and \\r\\n in one file). This is common and ideal.\n"));
1657+
_(" An \\r by itself will be taken as one line ending (counts: %d \\r by themselves vs %d [\\r]*\\n). This happens with old Mac CSV or when there is no \\r at all.\n") :
1658+
_(" \\n has been found in the input (counts: %d \\r by themselves vs %d [\\r]*\\n) and different lines can end with different line endings (e.g. mixed \\n and \\r\\n in one file). This is common and ideal.\n"),
1659+
count_r_only, count_with_n);
16371660

16381661
bool lastEOLreplaced = false;
16391662
if (args.filename) {

vignettes/datatable-reshape.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -279,7 +279,7 @@ melt(who, measure.vars = measure(
279279

280280
When using the `pattern` argument, it must be a Perl-compatible
281281
regular expression containing the same number of capture groups
282-
(parenthesized sub-expressions) as the number other arguments (group
282+
(parenthesized sub-expressions) as the number of other arguments (group
283283
names). The code below shows how to use a more complex regex with five
284284
groups, two numeric output columns, and an anonymous type conversion
285285
function,

0 commit comments

Comments
 (0)