Skip to content

Commit e6937f1

Browse files
fread: use fill with integer as ncol guess (#5119)
* fread: turn off sampling for fill * fixed stop * add stopf * fread: turn off sampling for fill * added coverage * coverage * revert additional argument * fill upperbound * integer as fill argument * fix typo * fix L * add NEWS * update verbose * undo verbose * init cleanup * fix typo news * renum NEWS * add proper cleanup of overallocated columns * add tests and coverage * fix tests * add tests * cleanup * update NEWS * update tests * Refine NEWS * use integer for fill Co-authored-by: Michael Chirico <[email protected]> * refine warning Co-authored-by: Michael Chirico <[email protected]> * wording Co-authored-by: Michael Chirico <[email protected]> * test readability * small tweak to NEWS --------- Co-authored-by: Michael Chirico <[email protected]>
1 parent 7cab6f1 commit e6937f1

File tree

7 files changed

+94
-14
lines changed

7 files changed

+94
-14
lines changed

NEWS.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,9 @@
2222

2323
5. `transpose` gains `list.cols=` argument, [#5639](https://github.com/Rdatatable/data.table/issues/5639). Use this to return output with list columns and avoids type promotion (an exception is `factor` columns which are promoted to `character` for consistency between `list.cols=TRUE` and `list.cols=FALSE`). This is convenient for creating a row-major representation of a table. Thanks to @MLopez-Ibanez for the request, and Benjamin Schwendinger for the PR.
2424

25-
4. Using `dt[, names(.SD) := lapply(.SD, fx)]` now works, [#795](https://github.com/Rdatatable/data.table/issues/795) -- one of our [most-requested issues (see #3189)](https://github.com/Rdatatable/data.table/issues/3189). Thanks to @brodieG for the report, 20 or so others for chiming in, and @ColeMiller1 for PR.
25+
6. Using `dt[, names(.SD) := lapply(.SD, fx)]` now works, [#795](https://github.com/Rdatatable/data.table/issues/795) -- one of our [most-requested issues (see #3189)](https://github.com/Rdatatable/data.table/issues/3189). Thanks to @brodieG for the report, 20 or so others for chiming in, and @ColeMiller1 for PR.
26+
27+
7. `fread`'s `fill` argument now also accepts an `integer` in addition to boolean values. `fread` always guesses the number of columns based on reading a sample of rows in the file. When `fill=TRUE`, `fread` stops reading and ignores subsequent rows when this estimate winds up too low, e.g. when the sampled rows happen to exclude some rows that are even wider, [#2727](https://github.com/Rdatatable/data.table/issues/2727) [#2691](https://github.com/Rdatatable/data.table/issues/2691) [#4130](https://github.com/Rdatatable/data.table/issues/4130) [#3436](https://github.com/Rdatatable/data.table/issues/3436). Providing an `integer` as argument for `fill` allows for a manual estimate of the number of columns instead, [#1812](https://github.com/Rdatatable/data.table/issues/1812) [#5378](https://github.com/Rdatatable/data.table/issues/5378). Thanks to @jangorecki, @christellacaze, @Yiguan, @alexdthomas, @ibombonato, @Befrancesco, @TobiasGold for reporting/requesting, and Benjamin Schwendinger for the PR.
2628
2729
## BUG FIXES
2830

R/fread.R

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,12 @@ yaml=FALSE, autostart=NA, tmpdir=tempdir(), tz="UTC")
2222
stopf("Argument 'encoding' must be 'unknown', 'UTF-8' or 'Latin-1'.")
2323
}
2424
stopifnot(
25-
isTRUEorFALSE(strip.white), isTRUEorFALSE(blank.lines.skip), isTRUEorFALSE(fill), isTRUEorFALSE(showProgress),
25+
isTRUEorFALSE(strip.white), isTRUEorFALSE(blank.lines.skip), isTRUEorFALSE(fill) || is.numeric(fill) && length(fill)==1L && fill >= 0L, isTRUEorFALSE(showProgress),
2626
isTRUEorFALSE(verbose), isTRUEorFALSE(check.names), isTRUEorFALSE(logical01), isTRUEorFALSE(keepLeadingZeros), isTRUEorFALSE(yaml),
2727
isTRUEorFALSE(stringsAsFactors) || (is.double(stringsAsFactors) && length(stringsAsFactors)==1L && 0.0<=stringsAsFactors && stringsAsFactors<=1.0),
2828
is.numeric(nrows), length(nrows)==1L
2929
)
30+
fill=as.integer(fill)
3031
nrows=as.double(nrows) #4686
3132
if (is.na(nrows) || nrows<0) nrows=Inf # accept -1 to mean Inf, as read.table does
3233
if (identical(header,"auto")) header=NA

inst/tests/tests.Rraw

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18389,3 +18389,33 @@ test(2250.12, dt[, names(.SD) := lapply(.SD, \(x) x + b), .SDcols = "a"], data.t
1838918389

1839018390
dt = data.table(a = 1L, b = 2L, c = 3L, d = 4L, e = 5L, f = 6L)
1839118391
test(2250.13, dt[, names(.SD)[1:5] := sum(.SD)], data.table(a = 21L, b = 21L, c = 21L, d = 21L, e = 21L, f = 6L))
18392+
18393+
# fread(...,fill) can also be used to specify a guess on the maximum number of columns #2691 #1812 #4130 #3436 #2727
18394+
dt_str = paste(rep(c("1,2\n", "1,2,3\n"), each=100), collapse="")
18395+
ans = data.table(1L, 2L, rep(c(NA, 3L), each=100L))
18396+
test(2251.01, fread(text = dt_str, fill=FALSE), ans[1:100, -3L], warning=".*Consider fill=TRUE.*")
18397+
test(2251.02, fread(text = dt_str, fill=TRUE), ans[1:100, -3L], warning=".*Consider fill=3.*")
18398+
test(2251.03, fread(text = dt_str, fill=2L), ans[1:100, -3L], warning=".*Consider fill=3.*")
18399+
test(2251.04, fread(text = dt_str, fill=3L), ans)
18400+
test(2251.05, fread(text = dt_str, fill=5L, verbose=TRUE), ans, output="Provided number of fill columns: 5 but only found 3\n Dropping 2 overallocated columns") # user guess slightly too big
18401+
test(2251.06, fread(text = dt_str, fill=1000L), ans) # user guess much too big
18402+
lines = c(
18403+
"12223, University",
18404+
"12227, bridge, Sky",
18405+
"12828, Sunset",
18406+
"13801, Ground",
18407+
"14853, Tranceamerica",
18408+
"14854, San Francisco",
18409+
"15595, shibuya, Shrine",
18410+
"16126, fog, San Francisco",
18411+
"16520, California, ocean, summer, golden gate, beach, San Francisco",
18412+
"")
18413+
text = paste(lines, collapse="\n")
18414+
test(2251.07, dim(fread(text)), c(6L, 3L), warning=c("fill=TRUE", "Discarded"))
18415+
test(2251.08, dim(fread(text, fill=TRUE)), c(9L, 9L))
18416+
text = paste(lines[c(1:5, 9L, 6:8, 10L)], collapse="\n")
18417+
test(2251.09, dim(fread(text)), c(3L, 3L), warning=c("fill=TRUE", "fill=7"))
18418+
test(2251.10, dim(fread(text, fill=TRUE)), c(9L, 9L))
18419+
test(2251.11, dim(fread(text, fill=7)), c(9L, 9L))
18420+
test(2251.12, dim(fread(text, fill=9)), c(9L, 9L))
18421+
test(2251.13, dim(fread(text, fill=20)), c(9L, 20L)) # clean up currently only kicks in if sep!=' '

man/fread.Rd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ yaml=FALSE, autostart=NA, tmpdir=tempdir(), tz="UTC"
5353
\item{encoding}{ default is \code{"unknown"}. Other possible options are \code{"UTF-8"} and \code{"Latin-1"}. Note: it is not used to re-encode the input, rather enables handling of encoded strings in their native encoding. }
5454
\item{quote}{ By default (\code{"\""}), if a field starts with a double quote, \code{fread} handles embedded quotes robustly as explained under \code{Details}. If it fails, then another attempt is made to read the field \emph{as is}, i.e., as if quotes are disabled. By setting \code{quote=""}, the field is always read as if quotes are disabled. It is not expected to ever need to pass anything other than \"\" to quote; i.e., to turn it off. }
5555
\item{strip.white}{ default is \code{TRUE}. Strips leading and trailing whitespaces of unquoted fields. If \code{FALSE}, only header trailing spaces are removed. }
56-
\item{fill}{logical (default is \code{FALSE}). If \code{TRUE} then in case the rows have unequal length, blank fields are implicitly filled.}
56+
\item{fill}{logical or integer (default is \code{FALSE}). If \code{TRUE} then in case the rows have unequal length, number of columns is estimated and blank fields are implicitly filled. If an integer is provided it is used as an upper bound for the number of columns. }
5757
\item{blank.lines.skip}{\code{logical}, default is \code{FALSE}. If \code{TRUE} blank lines in the input are ignored.}
5858
\item{key}{Character vector of one or more column names which is passed to \code{\link{setkey}}. It may be a single comma separated string such as \code{key="x,y,z"}, or a vector of names such as \code{key=c("x","y","z")}. Only valid when argument \code{data.table=TRUE}. Where applicable, this should refer to column names given in \code{col.names}. }
5959
\item{index}{ Character vector or list of character vectors of one or more column names which is passed to \code{\link{setindexv}}. As with \code{key}, comma-separated notation like \code{index="x,y,z"} is accepted for convenience. Only valid when argument \code{data.table=TRUE}. Where applicable, this should refer to column names given in \code{col.names}. }

src/fread.c

Lines changed: 36 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,9 @@ static const char* const* NAstrings;
5555
static bool any_number_like_NAstrings=false;
5656
static bool blank_is_a_NAstring=false;
5757
static bool stripWhite=true; // only applies to character columns; numeric fields always stripped
58-
static bool skipEmptyLines=false, fill=false;
58+
static bool skipEmptyLines=false;
59+
static int fill=0;
60+
static int *dropFill = NULL;
5961

6062
static double NA_FLOAT64; // takes fread.h:NA_FLOAT64_VALUE
6163

@@ -141,6 +143,7 @@ bool freadCleanup(void)
141143
free(tmpType); tmpType = NULL;
142144
free(size); size = NULL;
143145
free(colNames); colNames = NULL;
146+
free(dropFill); dropFill = NULL;
144147
if (mmp != NULL) {
145148
// Important to unmap as OS keeps internal reference open on file. Process is not exiting as
146149
// we're a .so/.dll here. If this was a process exiting we wouldn't need to unmap.
@@ -171,7 +174,7 @@ bool freadCleanup(void)
171174
stripWhite = true;
172175
skipEmptyLines = false;
173176
eol_one_r = false;
174-
fill = false;
177+
fill = 0;
175178
// following are borrowed references: do not free
176179
sof = eof = NULL;
177180
NAstrings = NULL;
@@ -1618,7 +1621,7 @@ int freadMain(freadMainArgs _args) {
16181621
if (eol(&ch)) ch++;
16191622
}
16201623
firstJumpEnd = ch; // size of first 100 lines in bytes is used later for nrow estimate
1621-
fill = true; // so that blank lines are read as empty
1624+
fill = 1; // so that blank lines are read as empty
16221625
ch = pos;
16231626
} else {
16241627
int nseps;
@@ -1750,7 +1753,7 @@ int freadMain(freadMainArgs _args) {
17501753
}
17511754
sep = topSep;
17521755
whiteChar = (sep==' ' ? '\t' : (sep=='\t' ? ' ' : 0));
1753-
ncol = topNumFields;
1756+
ncol = fill > topNumFields ? fill : topNumFields; // overwrite user guess if estimated number is higher
17541757
if (fill || sep==127) {
17551758
// leave pos on the first populated line; that is start of data
17561759
ch = pos;
@@ -2125,6 +2128,7 @@ int freadMain(freadMainArgs _args) {
21252128
int nTypeBump=0, nTypeBumpCols=0;
21262129
double tRead=0, tReread=0;
21272130
double thRead=0, thPush=0; // reductions of timings within the parallel region
2131+
int max_col=0;
21282132
char *typeBumpMsg=NULL; size_t typeBumpMsgSize=0;
21292133
int typeCounts[NUMTYPE]; // used for verbose output; needs populating after first read and before reread (if any) -- see later comment
21302134
#define internalErrSize 1000
@@ -2218,7 +2222,7 @@ int freadMain(freadMainArgs _args) {
22182222
}
22192223
prepareThreadContext(&ctx);
22202224

2221-
#pragma omp for ordered schedule(dynamic) reduction(+:thRead,thPush)
2225+
#pragma omp for ordered schedule(dynamic) reduction(+:thRead,thPush) reduction(max:max_col)
22222226
for (int jump = jump0; jump < nJumps; jump++) {
22232227
if (stopTeam) continue; // must continue and not break. We desire not to depend on (relatively new) omp cancel directive, yet
22242228
double tLast = 0.0; // thread local wallclock time at last measuring point for verbose mode only.
@@ -2299,6 +2303,7 @@ int freadMain(freadMainArgs _args) {
22992303
tch++;
23002304
j++;
23012305
}
2306+
if (j > max_col) max_col = j;
23022307
//*** END HOT. START TEPID ***//
23032308
if (tch==tLineStart) {
23042309
skip_white(&tch); // skips \0 before eof
@@ -2310,6 +2315,7 @@ int freadMain(freadMainArgs _args) {
23102315
int8_t thisSize = size[j];
23112316
if (thisSize) ((char **) targets)[thisSize] += thisSize;
23122317
j++;
2318+
if (j > max_col) max_col = j;
23132319
if (j==ncol) { tch++; myNrow++; continue; } // next line. Back up to while (tch<nextJumpStart). Usually happens, fastest path
23142320
}
23152321
else {
@@ -2509,6 +2515,25 @@ int freadMain(freadMainArgs _args) {
25092515
}
25102516
//-- end parallel ------------------
25112517

2518+
// cleanup since fill argument for number of columns was too high
2519+
if (fill>1 && max_col<ncol && max_col>0) {
2520+
int ndropFill = ncol - max_col;
2521+
if (verbose) {
2522+
DTPRINT(_(" Provided number of fill columns: %d but only found %d\n"), ncol, max_col);
2523+
DTPRINT(_(" Dropping %d overallocated columns\n"), ndropFill);
2524+
}
2525+
dropFill = (int *)malloc((size_t)ndropFill * sizeof(int));
2526+
int i=0;
2527+
for (int j=max_col; j<ncol; ++j) {
2528+
type[j] = CT_DROP;
2529+
size[j] = 0;
2530+
ndrop++;
2531+
nNonStringCols--;
2532+
dropFill[i++] = j;
2533+
}
2534+
dropFilledCols(dropFill, ndropFill);
2535+
}
2536+
25122537
if (stopTeam) {
25132538
if (internalErr[0]!='\0') {
25142539
STOP("%s", internalErr); // # nocov
@@ -2611,8 +2636,13 @@ int freadMain(freadMainArgs _args) {
26112636
else {
26122637
ch = headPos;
26132638
int tt = countfields(&ch);
2614-
DTWARN(_("Stopped early on line %"PRIu64". Expected %d fields but found %d. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<%s>>"),
2639+
if (fill>0) {
2640+
DTWARN(_("Stopped early on line %"PRIu64". Expected %d fields but found %d. Consider fill=%d or even more based on your knowledge of the input file. First discarded non-empty line: <<%s>>"),
2641+
(uint64_t)DTi+row1line, ncol, tt, tt, strlim(skippedFooter,500));
2642+
} else {
2643+
DTWARN(_("Stopped early on line %"PRIu64". Expected %d fields but found %d. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<%s>>"),
26152644
(uint64_t)DTi+row1line, ncol, tt, strlim(skippedFooter,500));
2645+
}
26162646
}
26172647
}
26182648
}

src/fread.h

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,8 +124,10 @@ typedef struct freadMainArgs
124124
bool skipEmptyLines;
125125

126126
// If True, then rows are allowed to have variable number of columns, and
127-
// all ragged rows will be filled with NAs on the right.
128-
bool fill;
127+
// all ragged rows will be filled with NAs on the right. Supplying integer
128+
// argument > 1 results in setting an upper bound estimate for the number
129+
// of columns.
130+
int fill;
129131

130132
// If True, then emit progress messages during the parsing.
131133
bool showProgress;
@@ -348,6 +350,11 @@ void pushBuffer(ThreadLocalFreadParsingContext *ctx);
348350
void setFinalNrow(size_t nrows);
349351

350352

353+
/**
354+
* Called at the end to delete columns added due to too high user guess for fill.
355+
*/
356+
void dropFilledCols(int* dropArg, int ndrop);
357+
351358
/**
352359
* Free any srtuctures associated with the thread-local parsing context.
353360
*/

src/freadR.c

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ static int64_t dtnrows = 0;
4545
static bool verbose = false;
4646
static bool warningsAreErrors = false;
4747
static bool oldNoDateTime = false;
48-
48+
static int *dropFill;
4949

5050
SEXP freadR(
5151
// params passed to freadMain
@@ -82,7 +82,7 @@ SEXP freadR(
8282
freadMainArgs args;
8383
ncol = 0;
8484
dtnrows = 0;
85-
85+
8686
if (!isString(inputArg) || LENGTH(inputArg)!=1)
8787
error(_("Internal error: freadR input not a single character string: a filename or the data itself. Should have been caught at R level.")); // # nocov
8888
const char *ch = (const char *)CHAR(STRING_ELT(inputArg,0));
@@ -152,7 +152,7 @@ SEXP freadR(
152152
// here we use bool and rely on fread at R level to check these do not contain NA_LOGICAL
153153
args.stripWhite = LOGICAL(stripWhiteArg)[0];
154154
args.skipEmptyLines = LOGICAL(skipEmptyLinesArg)[0];
155-
args.fill = LOGICAL(fillArg)[0];
155+
args.fill = INTEGER(fillArg)[0];
156156
args.showProgress = LOGICAL(showProgressArg)[0];
157157
if (INTEGER(nThreadArg)[0]<1) error(_("nThread(%d)<1"), INTEGER(nThreadArg)[0]);
158158
args.nth = (uint32_t)INTEGER(nThreadArg)[0];
@@ -533,6 +533,16 @@ void setFinalNrow(size_t nrow) {
533533
R_FlushConsole(); // # 2481. Just a convenient place; nothing per se to do with setFinalNrow()
534534
}
535535

536+
void dropFilledCols(int* dropArg, int ndelete) {
537+
dropFill = dropArg;
538+
int ndt=length(DT);
539+
for (int i=0; i<ndelete; ++i) {
540+
SET_VECTOR_ELT(DT, dropFill[i], R_NilValue);
541+
SET_STRING_ELT(colNamesSxp, dropFill[i], NA_STRING);
542+
}
543+
SETLENGTH(DT, ndt-ndelete);
544+
SETLENGTH(colNamesSxp, ndt-ndelete);
545+
}
536546

537547
void pushBuffer(ThreadLocalFreadParsingContext *ctx)
538548
{

0 commit comments

Comments
 (0)