Skip to content

Commit f92aee6

Browse files
tdhockshrektanMichaelChirico
authored
dcast only computes default fill if necessary (#5549)
* delete old commented code * new test for no warning fails * only compute default fill if missing cells present * any_NA_int helper * bugfix #5512 * Update src/fcast.c Co-authored-by: Xianying Tan <[email protected]> * Update src/fcast.c Co-authored-by: Xianying Tan <[email protected]> * mention warning text * const int args * add back ithiscol * get pointer before for loop * add test case from Michael * test min(dbl) and no warning when fill specified * Revert "delete old commented code" This reverts commit 2886c4f. * use suggestions from Michael * rm inline any_NA_int since that causes install to fail * clarify comment * link 5390 * mymin test fails * compute some_fill using anyNA in R then pass to C * Update R/fcast.R Co-authored-by: Michael Chirico <[email protected]> * Update R/fcast.R Co-authored-by: Michael Chirico <[email protected]> * dat_for_default_fill is zero-row dt * !length instead of length==0 * new dcast tests with fill=character * dat_for_default_fill is dat again, not 0-row, because that causes some test failure --------- Co-authored-by: Xianying Tan <[email protected]> Co-authored-by: Michael Chirico <[email protected]>
1 parent dbcb656 commit f92aee6

File tree

6 files changed

+49
-22
lines changed

6 files changed

+49
-22
lines changed

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@
2828

2929
3. Optimized `shift` per group produced wrong results when simultaneously subsetting, for example, `DT[i==1L, shift(x), by=group]`, [#5962](https://github.com/Rdatatable/data.table/issues/5962). Thanks to @renkun-ken for the report and Benjamin Schwendinger for the fix.
3030

31+
4. `dcast(fill=NULL)` only computes default fill value if necessary, which eliminates some previous warnings (for example, when fun.aggregate=min or max, warning was NAs introduced by coercion to integer range) which were potentially confusing, [#5512](https://github.com/Rdatatable/data.table/issues/5512), [#5390](https://github.com/Rdatatable/data.table/issues/5390). Thanks to Toby Dylan Hocking for the fix.
32+
3133
## NOTES
3234

3335
1. `transform` method for data.table sped up substantially when creating new columns on large tables. Thanks to @OfekShilon for the report and PR. The implemented solution was proposed by @ColeMiller1.

R/fcast.R

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -152,23 +152,22 @@ dcast.data.table = function(data, formula, fun.aggregate = NULL, sep = "_", ...,
152152
dat = .Call(CsubsetDT, dat, idx, seq_along(dat))
153153
}
154154
fun.call = m[["fun.aggregate"]]
155-
fill.default = NULL
156155
if (is.null(fun.call)) {
157156
oo = forderv(dat, by=varnames, retGrp=TRUE)
158157
if (attr(oo, 'maxgrpn', exact=TRUE) > 1L) {
159158
messagef("'fun.aggregate' is NULL, but found duplicate row/column combinations, so defaulting to length(). That is, the variables %s used in 'formula' do not uniquely identify rows in the input 'data'. In such cases, 'fun.aggregate' is used to derive a single representative value for each combination in the output data.table, for example by summing or averaging (fun.aggregate=sum or fun.aggregate=mean, respectively). Check the resulting table for values larger than 1 to see which combinations were not unique. See ?dcast.data.table for more details.", brackify(varnames))
160159
fun.call = quote(length)
161160
}
162161
}
163-
if (!is.null(fun.call)) {
162+
dat_for_default_fill = dat
163+
run_agg_funs = !is.null(fun.call)
164+
if (run_agg_funs) {
164165
fun.call = aggregate_funs(fun.call, lvals, sep, ...)
165-
errmsg = gettext("Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.")
166-
if (is.null(fill)) {
167-
fill.default = suppressWarnings(dat[0L][, eval(fun.call)])
168-
# tryCatch(fill.default <- dat[0L][, eval(fun.call)], error = function(x) stopf(errmsg))
169-
if (nrow(fill.default) != 1L) stopf(errmsg)
166+
maybe_err = function(list.of.columns) {
167+
if (any(lengths(list.of.columns) != 1L)) stopf("Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.")
168+
list.of.columns
170169
}
171-
dat = dat[, eval(fun.call), by=c(varnames)]
170+
dat = dat[, maybe_err(eval(fun.call)), by=c(varnames)]
172171
}
173172
order_ = function(x) {
174173
o = forderv(x, retGrp=TRUE, sort=TRUE)
@@ -211,7 +210,12 @@ dcast.data.table = function(data, formula, fun.aggregate = NULL, sep = "_", ...,
211210
}
212211
maplen = vapply_1i(mapunique, length)
213212
idx = do.call("CJ", mapunique)[map, 'I' := .I][["I"]] # TO DO: move this to C and avoid materialising the Cross Join.
214-
ans = .Call(Cfcast, lhs, val, maplen[[1L]], maplen[[2L]], idx, fill, fill.default, is.null(fun.call))
213+
some_fill = anyNA(idx)
214+
fill.default = if (run_agg_funs && is.null(fill) && some_fill) dat_for_default_fill[, maybe_err(eval(fun.call))]
215+
if (run_agg_funs && is.null(fill) && some_fill) {
216+
fill.default = dat_for_default_fill[0L][, maybe_err(eval(fun.call))]
217+
}
218+
ans = .Call(Cfcast, lhs, val, maplen[[1L]], maplen[[2L]], idx, fill, fill.default, is.null(fun.call), some_fill)
215219
allcols = do.call("paste", c(rhs, sep=sep))
216220
if (length(valnames) > 1L)
217221
allcols = do.call("paste", if (identical(".", allcols)) list(valnames, sep=sep)

inst/tests/tests.Rraw

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3729,6 +3729,21 @@ test(1100, dt1[dt2,roll=-Inf,rollends=c(FALSE,TRUE)]$ind, INT(NA,NA,1,2,2,2,2,2,
37293729
DT = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE), z=sample(letters[1:2],20,TRUE), d1=runif(20), d2=1L)
37303730
test(1102.38, names(dcast(DT, x ~ y + z, fun.aggregate=length, value.var = "d2", sep=".")),
37313731
c("x", "1.a", "1.b", "2.a", "2.b"))
3732+
3733+
# test for #5512, only compute default fill if needed.
3734+
DT = data.table(chr=c("a","b","b"), int=1:3, dbl=as.double(4:6))
3735+
mymin <- function(x){
3736+
if (!length(x)) stop("calling mymin on vector of length 0")
3737+
min(x)
3738+
}
3739+
test(1102.39, dcast(DT, . ~ chr, mymin, value.var="int"), data.table(.=".",a=1L,b=2L,key=".")) # fill not used in output, so default fill not computed.
3740+
ans <- data.table(int=1:3, a=c(1L,NA,NA), b=c(NA,2L,3L), key="int")
3741+
test(1102.40, dcast(DT, int ~ chr, min, value.var="int"), ans, warning=c("no non-missing arguments to min; returning Inf", "inf (type 'double') at RHS position 1 out-of-range(NA) or truncated (precision lost) when assigning to type 'integer' (target vector)")) # warning emitted when coercing default fill since as.integer(min(integer()) is Inf) is NA.
3742+
test(1102.41, dcast(DT, int ~ chr, mymin, value.var="int", fill=NA), ans) # because fill=NA is provided by user, no need to call mymin(integer()).
3743+
test(1102.42, dcast(DT, int ~ chr, min, value.var="dbl"), data.table(int=1:3, a=c(4,Inf,Inf), b=c(Inf,5,6), key="int"), warning="no non-missing arguments to min; returning Inf") # only one warning, because no coercion.
3744+
test(1102.43, dcast(DT, int ~ chr, min, value.var="dbl", fill="coerced to NA"), data.table(int=1:3, a=c(4,NA,NA), b=c(NA,5,6), key="int"), warning=c("Coercing 'character' RHS to 'double' to match the type of target vector.", "NAs introduced by coercion"))
3745+
test(1102.44, dcast(DT, int ~ ., value.var="dbl", fill="ignored"), data.table(int=1:3, .=c(4,5,6), key="int"))
3746+
37323747
}
37333748

37343749
# test for freading commands

man/dcast.data.table.Rd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
\item{\dots}{Any other arguments that may be passed to the aggregating function.}
2323
\item{margins}{Not implemented yet. Should take variable names to compute margins on. A value of \code{TRUE} would compute all margins.}
2424
\item{subset}{Specified if casting should be done on a subset of the data. Ex: \code{subset = .(col1 <= 5)} or \code{subset = .(variable != "January")}.}
25-
\item{fill}{Value with which to fill missing cells. If \code{fun.aggregate} is present, takes the value by applying the function on a 0-length vector.}
25+
\item{fill}{Value with which to fill missing cells. If \code{fill=NULL} and missing cells are present, then \code{fun.aggregate} is used on a 0-length vector to obtain a fill value.}
2626
\item{drop}{\code{FALSE} will cast by including all missing combinations.
2727

2828
\code{c(FALSE, TRUE)} will only include all missing combinations of formula \code{LHS}; \code{c(TRUE, FALSE)} will only include all missing combinations of formula RHS. See Examples.}

src/data.table.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -289,7 +289,7 @@ SEXP setlistelt(SEXP, SEXP, SEXP);
289289
SEXP address(SEXP);
290290
SEXP expandAltRep(SEXP);
291291
SEXP fmelt(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
292-
SEXP fcast(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
292+
SEXP fcast(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
293293
SEXP issorted(SEXP, SEXP);
294294
SEXP gforce(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
295295
SEXP gsum(SEXP, SEXP);

src/fcast.c

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
// raise(SIGINT);
55

66
// TO DO: margins
7-
SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fill, SEXP fill_d, SEXP is_agg) {
7+
SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fill, SEXP fill_d, SEXP is_agg, SEXP some_fillArg) {
88
int nrows=INTEGER(nrowArg)[0], ncols=INTEGER(ncolArg)[0];
99
int nlhs=length(lhs), nval=length(val), *idx = INTEGER(idxArg);
1010
SEXP target;
@@ -15,24 +15,28 @@ SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fil
1515
SET_VECTOR_ELT(ans, i, VECTOR_ELT(lhs, i));
1616
}
1717
// get val cols
18+
bool some_fill = LOGICAL(some_fillArg)[0];
1819
for (int i=0; i<nval; ++i) {
1920
const SEXP thiscol = VECTOR_ELT(val, i);
2021
SEXP thisfill = fill;
2122
const SEXPTYPE thistype = TYPEOF(thiscol);
2223
int nprotect = 0;
23-
if (isNull(fill)) {
24-
if (LOGICAL(is_agg)[0]) {
25-
thisfill = PROTECT(allocNAVector(thistype, 1)); nprotect++;
26-
} else thisfill = VECTOR_ELT(fill_d, i);
27-
}
28-
if (isVectorAtomic(thiscol)) { // defer error handling to below, but also skip on list
29-
thisfill = PROTECT(coerceAs(thisfill, thiscol, /*copyArg=*/ScalarLogical(false))); nprotect++;
24+
if(some_fill){
25+
if (isNull(fill)) {
26+
if (LOGICAL(is_agg)[0]) {
27+
thisfill = PROTECT(allocNAVector(thistype, 1)); nprotect++;
28+
} else thisfill = VECTOR_ELT(fill_d, i);
29+
}
30+
if (isVectorAtomic(thiscol)) { // defer error handling to below, but also skip on list
31+
thisfill = PROTECT(coerceAs(thisfill, thiscol, /*copyArg=*/ScalarLogical(false))); nprotect++;
32+
}
3033
}
3134
switch (thistype) {
3235
case INTSXP:
3336
case LGLSXP: {
3437
const int *ithiscol = INTEGER(thiscol);
35-
const int *ithisfill = INTEGER(thisfill);
38+
const int *ithisfill = 0;
39+
if (some_fill) ithisfill = INTEGER(thisfill);
3640
for (int j=0; j<ncols; ++j) {
3741
SET_VECTOR_ELT(ans, nlhs+j+i*ncols, target=allocVector(thistype, nrows) );
3842
int *itarget = INTEGER(target);
@@ -45,7 +49,8 @@ SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fil
4549
} break;
4650
case REALSXP: {
4751
const double *dthiscol = REAL(thiscol);
48-
const double *dthisfill = REAL(thisfill);
52+
const double *dthisfill = 0;
53+
if (some_fill) dthisfill = REAL(thisfill);
4954
for (int j=0; j<ncols; ++j) {
5055
SET_VECTOR_ELT(ans, nlhs+j+i*ncols, target=allocVector(thistype, nrows) );
5156
double *dtarget = REAL(target);
@@ -58,7 +63,8 @@ SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fil
5863
} break;
5964
case CPLXSXP: {
6065
const Rcomplex *zthiscol = COMPLEX(thiscol);
61-
const Rcomplex *zthisfill = COMPLEX(thisfill);
66+
const Rcomplex *zthisfill = 0;
67+
if (some_fill) zthisfill = COMPLEX(thisfill);
6268
for (int j=0; j<ncols; ++j) {
6369
SET_VECTOR_ELT(ans, nlhs+j+i*ncols, target=allocVector(thistype, nrows) );
6470
Rcomplex *ztarget = COMPLEX(target);

0 commit comments

Comments
 (0)