Skip to content

Commit ea8f12b

Browse files
committed
add sampling of line endings
1 parent 67db7f7 commit ea8f12b

File tree

3 files changed

+30
-4
lines changed

3 files changed

+30
-4
lines changed

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -529,6 +529,8 @@ rowwiseDT(
529529
530530
21. `setDT(get0('var'))` now correctly modifies `var` by reference, consistent with the long-standing behavior of `setDT(get('var'))`, [#6864](https://github.com/Rdatatable/data.table/issues/6864). Thanks to @rikivillalba for the report and @venom1204 for the fix.
531531
532+
22. `fread()` could fail to read Mac CSV files (with `\r` line endings) if the file contained any `\n` character, such as a final `\r\n`. This was fixed by detecting the predominant line ending in a sample of the file, [#4186](https://github.com/Rdatatable/data.table/issues/4186). Thanks to @MPagel for the report and @ben-schwen for the fix.
533+
532534
### NOTES
533535
534536
1. There is a new vignette on joins! See `vignette("datatable-joins")`. Thanks to Angel Feliz for authoring it! Feedback welcome. This vignette has been highly requested since 2017: [#2181](https://github.com/Rdatatable/data.table/issues/2181).

inst/tests/tests.Rraw

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11952,6 +11952,8 @@ tt = setDT(read.csv(f, stringsAsFactors=FALSE))
1195211952
tt[2, B:=gsub("\n","\r",B)] # base R changes the \r to a \n, so restore that
1195311953
test(1778.4, tt, DT)
1195411954
unlink(f)
11955+
# fread has problems with mixed \r and \r\n #4186
11956+
test(1778.5, fread(text="Col1;Col2\r1;data1\r2;data2\r3;data3\r\n"), data.table(Col1=1:3, Col2=c("data1","data2","data3")))
1195511957

1195611958
# #1392 IDate ITime new methods for faster conversion
1195711959
# conversion in-out match for UTC

src/fread.c

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1628,12 +1628,34 @@ int freadMain(freadMainArgs _args)
16281628
if (verbose) DTPRINT(_("[04] Arrange mmap to be \\0 terminated\n"));
16291629

16301630
// First, set 'eol_one_r' for use by eol() to know if \r-only line ending is allowed, #2371
1631+
// Count different line ending types to handle mixed endings (e.g. Mac CSV with mostly \r and final \r\n) #4186
1632+
int count_r_only = 0; // \r not followed by \n
1633+
int count_with_n = 0; // \n with or without \r
16311634
ch = sof;
1632-
while (ch < eof && *ch != '\n') ch++;
1633-
eol_one_r = (ch == eof);
1635+
const char *sample_end = eof;
1636+
if ((size_t)(eof - sof) > 100000) sample_end = sof + 100000; // Sample first 100KB or whole file if smaller
1637+
while (ch < sample_end) {
1638+
if (*ch == '\r') {
1639+
if (ch + 1 < sample_end && ch[1] == '\n') {
1640+
count_with_n++;
1641+
ch += 2; // skip \r\n
1642+
} else {
1643+
count_r_only++;
1644+
ch++;
1645+
}
1646+
} else if (*ch == '\n') {
1647+
count_with_n++;
1648+
ch++;
1649+
} else {
1650+
ch++;
1651+
}
1652+
}
1653+
// If file has mostly \r-only line endings, treat \r as line ending
1654+
eol_one_r = (count_r_only > count_with_n);
16341655
if (verbose) DTPRINT(eol_one_r ?
1635-
_(" No \\n exists in the file at all, so single \\r (if any) will be taken as one line ending. This is unusual but will happen normally when there is no \\r either; e.g. a single line missing its end of line.\n") :
1636-
_(" \\n has been found in the input and different lines can end with different line endings (e.g. mixed \\n and \\r\\n in one file). This is common and ideal.\n"));
1656+
_(" Single \\r (if any) will be taken as one line ending (count: %d \\r vs %d \\n). This happens with old Mac CSV or when there is no \\r either.\n") :
1657+
_(" \\n has been found in the input (count: %d \\r vs %d \\n) and different lines can end with different line endings (e.g. mixed \\n and \\r\\n in one file). This is common and ideal.\n"),
1658+
count_r_only, count_with_n);
16371659

16381660
bool lastEOLreplaced = false;
16391661
if (args.filename) {

0 commit comments

Comments
 (0)