Skip to content

Commit 19d9a0f

Browse files
committed
fix sep detection for single column quoted file
1 parent 55b0de6 commit 19d9a0f

File tree

3 files changed

+29
-6
lines changed

3 files changed

+29
-6
lines changed

NEWS.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -332,6 +332,9 @@
332332
333333
19. Ellipsis elements like `..1` are correctly excluded when searching for variables in "up-a-level" syntax inside `[`, [#5460](https://github.com/Rdatatable/data.table/issues/5460). Thanks @ggrothendieck for the report and @MichaelChirico for the fix.
334334
335+
20. `fread()` auto-detects separators for single-column files consisting solely of quoted values (e.g. `"this_that"\n"2025-01-01 00:00:01"`), [#7366](https://github.com/Rdatatable/data.table/issues/7366). Thanks @arunsrinivasan
336+
for the report and @ben-schwen for the fix.
337+
335338
### NOTES
336339
337340
1. The following in-progress deprecations have proceeded:

inst/tests/tests.Rraw

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21685,3 +21685,7 @@ d3 = unserialize(serialize(d2, NULL))
2168521685
test(2340.05, .selfref.ok(d3), FALSE)
2168621686
setDT(d3)
2168721687
test(2340.06, .selfref.ok(d3), TRUE)
21688+
21689+
# fread with quotes and single column #7366
21690+
str = '"this_that"\n"2025-01-01 00:00:01"'
21691+
test(2341, fread(str), data.table(this_that = as.POSIXct("2025-01-01 00:00:01", tz="UTC")))

src/fread.c

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1812,14 +1812,30 @@ int freadMain(freadMainArgs _args)
18121812
thisBlockStart = lineStart;
18131813
}
18141814
}
1815-
if ((thisBlockLines > topNumLines && lastncol > 1) || // more lines wins even with fewer fields, so long as number of fields >= 2
1816-
(thisBlockLines == topNumLines &&
1817-
lastncol > topNumFields && // when number of lines is tied, choose the sep which separates it into more columns
1818-
(quoteRule < QUOTE_RULE_EMBEDDED_QUOTES_NOT_ESCAPED || quoteRule <= topQuoteRule) && // for test 1834 where every line contains a correctly quoted field contain sep
1819-
(topNumFields <= 1 || sep != ' '))) {
1815+
bool blockHasQuote = false;
1816+
if (quote && lastncol == 1) {
1817+
for (const char *scan = thisBlockStart; scan < ch; scan++) {
1818+
if (*scan == quote) {
1819+
blockHasQuote = true;
1820+
break;
1821+
}
1822+
if (*scan == '\n' || *scan == '\r') continue;
1823+
}
1824+
}
1825+
bool singleColumnCandidate = (lastncol == 1 && thisBlockLines >= 2 && blockHasQuote && quoteRule < QUOTE_RULE_IGNORE_QUOTES);
1826+
// more contiguous rows than the current best; only allow 1-column wins while we still have no multi-column pick
1827+
bool betterLines = thisBlockLines > topNumLines && (lastncol > 1 || (singleColumnCandidate && topNumFields <= 1));
1828+
// first multi-column candidate after only single-column options so far
1829+
bool promoteOverSingle = (topNumFields <= 1 && lastncol > topNumFields && thisBlockLines >= 2);
1830+
// same number of rows as current best but more fields (legacy tie-breaker)
1831+
bool betterTie = (thisBlockLines == topNumLines &&
1832+
lastncol > topNumFields && // when number of lines is tied, choose the sep which separates it into more columns
1833+
(quoteRule < QUOTE_RULE_EMBEDDED_QUOTES_NOT_ESCAPED || quoteRule <= topQuoteRule) && // for test 1834 where every line contains a correctly quoted field contain sep
1834+
(topNumFields <= 1 || sep != ' '));
1835+
if (betterLines || promoteOverSingle || betterTie) {
18201836
topNumLines = thisBlockLines;
18211837
topNumFields = lastncol;
1822-
topSep = sep;
1838+
topSep = (singleColumnCandidate && lastncol == 1) ? 127 : sep; // treat consistent single-column quoted blocks as single-column input (#7366)
18231839
topQuoteRule = quoteRule;
18241840
firstJumpEnd = ch;
18251841
topStart = thisBlockStart;

0 commit comments

Comments
 (0)