Skip to content

Commit df7fa80

Browse files
authored
fix sep detection for single column quoted file (#7370)
* fix sep detection for single column quoted file * restore old comment * remove unnecessary check * simplify quote scan * Renumber NEWS
1 parent 1685a3b commit df7fa80

File tree

3 files changed

+26
-6
lines changed

3 files changed

+26
-6
lines changed

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -344,6 +344,8 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
344344
345345
22. `setDTthreads(percent=)` and `setDTthreads(threads=)` now respect `OMP_NUM_THREADS` and `omp_get_max_threads()`, ensuring consistency with `setDTthreads()` (no arguments) when OpenMP environment variables are set, [#7165](https://github.com/Rdatatable/data.table/issues/7165). Previously, explicitly setting a thread count or percentage would ignore these OpenMP limits, potentially exceeding the user's intended thread cap. Thanks to @bastistician for the report and @ben-schwen for the fix.
346346

347+
23. `fread()` auto-detects separators for single-column files consisting solely of quoted values (e.g. `"this_that"\n"2025-01-01 00:00:01"`), [#7366](https://github.com/Rdatatable/data.table/issues/7366). Thanks @arunsrinivasan for the report and @ben-schwen for the fix.
348+
347349
### NOTES
348350

349351
1. The following in-progress deprecations have proceeded:

inst/tests/tests.Rraw

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21855,3 +21855,6 @@ test(2344.03, setkey(d1[, .(V1, label = c("one", "zero", "one"), V2)][data.table
2185521855
# keep sub-key in case of multiple keys, even with new columns and changing column order
2185621856
DT = data.table(V1 = 1:2, V2 = 3:4, V3 = 5:6, key = c("V1", "V2", "V3"))
2185721857
test(2344.04, key(DT[, .(V4 = c("b", "a"), V2, V5 = c("y", "x"), V1)]), c("V1", "V2"))
21858+
21859+
# fread with quotes and single column #7366
21860+
test(2345, fread('"this_that"\n"2025-01-01 00:00:01"'), data.table(this_that = as.POSIXct("2025-01-01 00:00:01", tz="UTC")))

src/fread.c

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1899,14 +1899,29 @@ int freadMain(freadMainArgs _args)
18991899
thisBlockStart = lineStart;
19001900
}
19011901
}
1902-
if ((thisBlockLines > topNumLines && lastncol > 1) || // more lines wins even with fewer fields, so long as number of fields >= 2
1903-
(thisBlockLines == topNumLines &&
1904-
lastncol > topNumFields && // when number of lines is tied, choose the sep which separates it into more columns
1905-
(quoteRule < QUOTE_RULE_EMBEDDED_QUOTES_NOT_ESCAPED || quoteRule <= topQuoteRule) && // for test 1834 where every line contains a correctly quoted field contain sep
1906-
(topNumFields <= 1 || sep != ' '))) {
1902+
bool blockHasQuote = false;
1903+
if (quote && lastncol == 1) {
1904+
for (const char *scan = thisBlockStart; scan < ch; scan++) {
1905+
if (*scan == quote) {
1906+
blockHasQuote = true;
1907+
break;
1908+
}
1909+
}
1910+
}
1911+
bool singleColumnCandidate = (lastncol == 1 && thisBlockLines >= 2 && blockHasQuote && quoteRule < QUOTE_RULE_IGNORE_QUOTES);
1912+
// more contiguous rows than the current best; only allow 1-column wins while we still have no multi-column pick
1913+
bool betterLines = thisBlockLines > topNumLines && (lastncol > 1 || (singleColumnCandidate && topNumFields <= 1));
1914+
// first multi-column candidate after only single-column options so far
1915+
bool promoteOverSingle = (topNumFields <= 1 && lastncol > topNumFields && thisBlockLines >= 2);
1916+
// more lines wins even with fewer fields, so long as number of fields >= 2
1917+
bool betterTie = (thisBlockLines == topNumLines &&
1918+
lastncol > topNumFields && // when number of lines is tied, choose the sep which separates it into more columns
1919+
(quoteRule < QUOTE_RULE_EMBEDDED_QUOTES_NOT_ESCAPED || quoteRule <= topQuoteRule) && // for test 1834 where every line contains a correctly quoted field contain sep
1920+
(topNumFields <= 1 || sep != ' '));
1921+
if (betterLines || promoteOverSingle || betterTie) {
19071922
topNumLines = thisBlockLines;
19081923
topNumFields = lastncol;
1909-
topSep = sep;
1924+
topSep = singleColumnCandidate ? 127 : sep; // treat consistent single-column quoted blocks as single-column input (#7366)
19101925
topQuoteRule = quoteRule;
19111926
firstJumpEnd = ch;
19121927
topStart = thisBlockStart;

0 commit comments

Comments
 (0)