Skip to content

Commit 96e89fa

Browse files
Allow double-integer64 joins when double is in (integer32 , integer64] range (#6626)
* Allow double-integer64 joins when double is in (integer32 , integer64] range * rename R-side argument for readability? * Totally drop isReallyReal, just use isRealReallyInt with flavors for 32/64 * Error: flip result when changing to isRealReallyInt* * Further simplify -- first* helpers not needed if we just return bool * logical inversion * Subtle difference vs isReallyReal (type check) * Same subtle difference in .prepareFastSubset * .prepareFastSubset fix * fix test output * Add duplicate bug number to NEWS * amend a new call site for isReallyReal * fix botched merge * add codecov test * add non exported function * isRealReallyInt -> fitsInInt --------- Co-authored-by: Benjamin Schwendinger <[email protected]>
1 parent a36caac commit 96e89fa

File tree

12 files changed

+90
-64
lines changed

12 files changed

+90
-64
lines changed

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ rowwiseDT(
117117
118118
15. The auto-printing suppression in `knitr` documents is now done by implementing a method for `knit_print` instead of looking up the call stack, [#6589](https://github.com/Rdatatable/data.table/pull/6589). Thanks to @jangorecki for the report [#6509](https://github.com/Rdatatable/data.table/issues/6509) and @aitap for the fix.
119119
120+
16. Joins of `integer64` and `double` columns succeed when the `double` column has lossless `integer64` representation, [#4167](https://github.com/Rdatatable/data.table/issues/4167) and [#6625](https://github.com/Rdatatable/data.table/issues/6625). Previously, this only worked when the double column had lossless _32-bit_ integer representation. Thanks @MichaelChirico for the reports and fix.
121+
120122
## NOTES
121123
122124
1. There is a new vignette on joins! See `vignette("datatable-joins")`. Thanks to Angel Feliz for authoring it! Feedback welcome. This vignette has been highly requested since 2017: [#2181](https://github.com/Rdatatable/data.table/issues/2181).

R/IDateTime.R

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -99,9 +99,9 @@ round.IDate = function(x, digits=c("weeks", "months", "quarters", "years"), ...)
9999
# TODO: investigate Ops.IDate method a la Ops.difftime
100100
if (inherits(e1, "difftime") || inherits(e2, "difftime"))
101101
internal_error("difftime objects may not be added to IDate, but Ops dispatch should have intervened to prevent this") # nocov
102-
if (isReallyReal(e1) || isReallyReal(e2)) {
102+
# IDate doesn't support fractional days; revert to base Date
103+
if ((is.double(e1) && !fitsInInt32(e1)) || (is.double(e2) && !fitsInInt32(e2))) {
103104
return(`+.Date`(e1, e2))
104-
# IDate doesn't support fractional days; revert to base Date
105105
}
106106
if (inherits(e1, "Date") && inherits(e2, "Date"))
107107
stopf("binary + is not defined for \"IDate\" objects")
@@ -120,7 +120,7 @@ round.IDate = function(x, digits=c("weeks", "months", "quarters", "years"), ...)
120120
if (inherits(e2, "difftime"))
121121
internal_error("difftime objects may not be subtracted from IDate, but Ops dispatch should have intervened to prevent this") # nocov
122122

123-
if ( isReallyReal(e2) ) {
123+
if ( is.double(e2) && !fitsInInt32(e2) ) {
124124
# IDate deliberately doesn't support fractional days so revert to base Date
125125
return(base::`-.Date`(as.Date(e1), e2))
126126
# can't call base::.Date directly (last line of base::`-.Date`) as tried in PR#3168 because

R/bmerge.R

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ mergeType = function(x) {
44
ans = typeof(x)
55
if (ans=="integer") { if (is.factor(x)) ans = "factor" }
66
else if (ans=="double") { if (inherits(x, "integer64")) ans = "integer64" }
7-
# do not call isReallyReal(x) yet because i) if both types are double we don't need to coerce even if one or both sides
7+
# do not call fitsInInt*(x) yet because i) if both types are double we don't need to coerce even if one or both sides
88
# are int-as-double, and ii) to save calling it until we really need it
99
ans
1010
}
@@ -103,23 +103,23 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
103103
if (x_merge_type=="integer64" || i_merge_type=="integer64") {
104104
nm = c(iname, xname)
105105
if (x_merge_type=="integer64") { w=i; wc=icol; wclass=i_merge_type; } else { w=x; wc=xcol; wclass=x_merge_type; nm=rev(nm) } # w is which to coerce
106-
if (wclass=="integer" || (wclass=="double" && !isReallyReal(w[[wc]]))) {
107-
if (verbose) catf("Coercing %s column %s%s to type integer64 to match type of %s.\n", wclass, nm[1L], if (wclass=="double") " (which contains no fractions)" else "", nm[2L])
106+
if (wclass=="integer" || (wclass=="double" && fitsInInt64(w[[wc]]))) {
107+
if (verbose) catf("Coercing %s column %s%s to type integer64 to match type of %s.\n", wclass, nm[1L], if (wclass=="double") " (which has integer64 representation, e.g. no fractions)" else "", nm[2L])
108108
set(w, j=wc, value=bit64::as.integer64(w[[wc]]))
109-
} else stopf("Incompatible join types: %s is type integer64 but %s is type double and contains fractions", nm[2L], nm[1L])
109+
} else stopf("Incompatible join types: %s is type integer64 but %s is type double and cannot be coerced to integer64 (e.g. has fractions)", nm[2L], nm[1L])
110110
} else {
111111
# just integer and double left
112112
ic_idx = which(icol == icols) # check if on is joined on multiple conditions, #6602
113113
if (i_merge_type=="double") {
114114
coerce_x = FALSE
115-
if (!isReallyReal(i[[icol]])) {
115+
if (fitsInInt32(i[[icol]])) {
116116
coerce_x = TRUE
117117
# common case of ad hoc user-typed integers missing L postfix joining to correct integer keys
118118
# we've always coerced to int and returned int, for convenience.
119119
if (length(ic_idx)>1L) {
120120
xc_idx = xcols[ic_idx]
121121
for (xb in xc_idx[which(vapply_1c(.shallow(x, xc_idx), mergeType) == "double")]) {
122-
if (isReallyReal(x[[xb]])) {
122+
if (!fitsInInt32(x[[xb]])) {
123123
coerce_x = FALSE
124124
break
125125
}

R/data.table.R

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3241,9 +3241,9 @@ is_constantish = function(q, check_singleton=FALSE) {
32413241
if (length(RHS) != nrow(x)) stopf("RHS of %s is length %d which is not 1 or nrow (%d). For robustness, no recycling is allowed (other than of length 1 RHS). Consider %%in%% instead.", operator, length(RHS), nrow(x))
32423242
return(NULL) # DT[colA == colB] regular element-wise vector scan
32433243
}
3244-
if ( mode(x[[col]]) != mode(RHS) || # mode() so that doubleLHS/integerRHS and integerLHS/doubleRHS!isReallyReal are optimized (both sides mode 'numeric')
3245-
is.factor(x[[col]])+is.factor(RHS) == 1L || # but factor is also mode 'numeric' so treat that separately
3246-
is.integer(x[[col]]) && isReallyReal(RHS) ) { # and if RHS contains fractions then don't optimize that as bmerge truncates the fractions to match to the target integer type
3244+
if ( (mode(x[[col]]) != mode(RHS)) || # mode() so that doubleLHS/integerRHS and integerLHS/doubleRHS&fitsInInt32 are optimized (both sides mode 'numeric')
3245+
(is.factor(x[[col]])+is.factor(RHS) == 1L) || # but factor is also mode 'numeric' so treat that separately
3246+
(is.integer(x[[col]]) && is.double(RHS) && !fitsInInt32(RHS)) ) { # and if RHS contains fractions then don't optimize that as bmerge truncates the fractions to match to the target integer type
32473247
# re-direct non-matching type cases to base R, as data.table's binary
32483248
# search based join is strict in types. #957, #961 and #1361
32493249
# the mode() checks also deals with NULL since mode(NULL)=="NULL" and causes this return, as one CRAN package (eplusr 0.9.1) relies on

R/wrappers.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ colnamesInt = function(x, cols, check_dups=FALSE, skip_absent=FALSE) .Call(Ccoln
1717

1818
testMsg = function(status=0L, nx=2L, nk=2L) .Call(CtestMsgR, as.integer(status)[1L], as.integer(nx)[1L], as.integer(nk)[1L])
1919

20-
isRealReallyInt = function(x) .Call(CisRealReallyIntR, x)
21-
isReallyReal = function(x) .Call(CisReallyReal, x)
20+
fitsInInt32 = function(x) .Call(CfitsInInt32R, x)
21+
fitsInInt64 = function(x) .Call(CfitsInInt64R, x)
2222

2323
coerceAs = function(x, as, copy=TRUE) .Call(CcoerceAs, x, as, copy)

inst/tests/tests.Rraw

Lines changed: 40 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,8 @@ if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) {
5656
internal_error = data.table:::internal_error
5757
is_na = data.table:::is_na
5858
is.sorted = data.table:::is.sorted
59-
isReallyReal = data.table:::isReallyReal
60-
isRealReallyInt = data.table:::isRealReallyInt
59+
fitsInInt32 = data.table:::fitsInInt32
60+
fitsInInt64 = data.table:::fitsInInt64
6161
is_utc = data.table:::is_utc
6262
melt.data.table = data.table:::melt.data.table # for test 1953.4
6363
messagef = data.table:::messagef
@@ -7356,22 +7356,23 @@ if (test_bit64) {
73567356
X = list(a = 4:1, b=runif(4))
73577357
test(1513, setkey(as.data.table(X), a), setDT(X, key="a"))
73587358

7359-
# Adding tests for `isReallyReal`
7359+
# Adding tests for `fitsInInt32`
73607360
x = as.numeric(sample(10))
7361-
test(1514.1, isReallyReal(x), 0L)
7361+
test(1514.1, fitsInInt32(x), TRUE)
73627362
x = as.numeric(sample(c(1:5, NA)))
7363-
test(1514.2, isReallyReal(x), 0L) # NAs in numeric can be coerced to integer NA without loss
7363+
test(1514.2, fitsInInt32(x), TRUE) # NAs in numeric can be coerced to integer NA without loss
73647364
x = c(1:2, NaN, NA)
7365-
test(1514.3, isReallyReal(x), 3L)
7365+
test(1514.3, fitsInInt32(x), FALSE)
73667366
x = c(1:2, Inf, NA)
7367-
test(1514.4, isReallyReal(x), 3L)
7367+
test(1514.4, fitsInInt32(x), FALSE)
73687368
x = c(1:2, -Inf, NA)
7369-
test(1514.5, isReallyReal(x), 3L)
7369+
test(1514.5, fitsInInt32(x), FALSE)
73707370
x = runif(2)
7371-
test(1514.6, isReallyReal(x), 1L)
7371+
test(1514.6, fitsInInt32(x), FALSE)
73727372
x = numeric()
7373-
test(1514.7, isReallyReal(x), 0L)
7374-
test(1514.8, isReallyReal(9L), 0L)
7373+
test(1514.7, fitsInInt32(x), TRUE)
7374+
test(1514.8, fitsInInt32(9L), FALSE) # b/c not double input
7375+
test(1514.9, fitsInInt64(9L), FALSE)
73757376

73767377
# #1091
73777378
options(datatable.prettyprint.char = 5L)
@@ -15150,19 +15151,19 @@ if (test_bit64) {
1515015151
output = "Coercing integer column x.int to type integer64 to match type of i.int64")
1515115152
test(2044.67, dt1[dt2, ..cols, on="doubleInt==int64", nomatch=0L, verbose=TRUE],
1515215153
ans,
15153-
output = "Coercing double column x.doubleInt (which contains no fractions) to type integer64 to match type of i.int64")
15154+
output = "Coercing double column x.doubleInt (which has integer64 representation, e.g. no fractions) to type integer64 to match type of i.int64")
1515415155
test(2044.68, dt1[dt2, ..cols, on="realDouble==int64", nomatch=0L, verbose=TRUE],
15155-
error="Incompatible join types: i.int64 is type integer64 but x.realDouble is type double and contains fractions")
15156+
error="Incompatible join types: i.int64 is type integer64 but x.realDouble is type double and cannot be coerced to integer64")
1515615157
# int64 in x
1515715158
test(2044.69, dt1[dt2, ..cols, on="int64==int", nomatch=0L, verbose=TRUE],
1515815159
ans<-data.table(x.int=1:5, x.doubleInt=as.double(1:5), x.realDouble=c(0.5,1.0,1.5,2.0,2.5), x.int64=as.integer64(1:5),
1515915160
i.int=1:5, i.doubleInt=as.double(1:5), i.realDouble=c(0.5,1.0,1.5,2.0,2.5), i.int64=as.integer64(c(1:4, 3000000000))),
1516015161
output = "Coercing integer column i.int to type integer64 to match type of x.int64")
1516115162
test(2044.70, dt1[dt2, ..cols, on="int64==doubleInt", nomatch=0L, verbose=TRUE],
1516215163
ans,
15163-
output = "Coercing double column i.doubleInt (which contains no fractions) to type integer64 to match type of x.int64")
15164+
output = "Coercing double column i.doubleInt (which has integer64 representation, e.g. no fractions) to type integer64 to match type of x.int64")
1516415165
test(2044.71, dt1[dt2, ..cols, on="int64==realDouble", nomatch=0L, verbose=TRUE],
15165-
error="Incompatible join types: x.int64 is type integer64 but i.realDouble is type double and contains fractions")
15166+
error="Incompatible join types: x.int64 is type integer64 but i.realDouble is type double and cannot be coerced to integer64")
1516615167
}
1516715168
# coercion of all-NA
1516815169
dt1 = data.table(a=1, b=NA_character_)
@@ -17603,18 +17604,18 @@ test(2204, as.data.table(mtcars, keep.rownames='model', key='model'),
1760317604

1760417605
# 2205 tested nanotime moved to other.Rraw 27, #5516
1760517606

17606-
# isRealReallyInt, #3966
17607-
test(2206.01, isRealReallyInt(c(-2147483647.0, NA, 0.0, 2147483647.0)), TRUE)
17608-
test(2206.02, isRealReallyInt(2147483648.0), FALSE) # >INT_MAX
17609-
test(2206.03, isRealReallyInt(-2147483648.0), FALSE) # <=INT_MIN since INT_MIN==NA_integer_
17610-
test(2206.04, isRealReallyInt(c(5,-5,2147483648)), FALSE) # test real last position
17611-
test(2206.05, isRealReallyInt(NaN), FALSE)
17612-
test(2206.06, isRealReallyInt(+Inf), FALSE)
17613-
test(2206.07, isRealReallyInt(-Inf), FALSE)
17614-
test(2206.08, isRealReallyInt(0.1), FALSE)
17615-
test(2206.09, isRealReallyInt(numeric()), TRUE)
17616-
test(2206.10, isRealReallyInt(9L), FALSE) # must be type double
17617-
test(2206.11, isRealReallyInt(integer()), FALSE)
17607+
# fitsInInt32, #3966
17608+
test(2206.01, fitsInInt32(c(-2147483647.0, NA, 0.0, 2147483647.0)), TRUE)
17609+
test(2206.02, fitsInInt32(2147483648.0), FALSE) # >INT_MAX
17610+
test(2206.03, fitsInInt32(-2147483648.0), FALSE) # <=INT_MIN since INT_MIN==NA_integer_
17611+
test(2206.04, fitsInInt32(c(5,-5,2147483648)), FALSE) # test real last position
17612+
test(2206.05, fitsInInt32(NaN), FALSE)
17613+
test(2206.06, fitsInInt32(+Inf), FALSE)
17614+
test(2206.07, fitsInInt32(-Inf), FALSE)
17615+
test(2206.08, fitsInInt32(0.1), FALSE)
17616+
test(2206.09, fitsInInt32(numeric()), TRUE)
17617+
test(2206.10, fitsInInt32(9L), FALSE) # must be type double
17618+
test(2206.11, fitsInInt32(integer()), FALSE)
1761817619

1761917620
# dcast supports complex value to cast, #4855
1762017621
DT = CJ(x=1:3, y=letters[1:2])
@@ -20664,3 +20665,15 @@ test(2299.09, format_list_item(data.table(a=numeric(), b=numeric())), output="<d
2066420665
test(2299.10, data.table(a=1), output="a\n1: *1")
2066520666
test(2299.11, data.table(a=list(data.frame(b=1))), output="a\n1: <data.frame[1x1]>")
2066620667
test(2299.12, data.table(a=list(data.table(b=1))), output="a\n1: <data.table[1x1]>")
20668+
20669+
if (test_bit64) {
20670+
# Join to integer64 doesn't require integer32 representation, just integer64, #6625
20671+
i64_val = .Machine$integer.max + 1
20672+
DT1 = data.table(id = as.integer64(i64_val))
20673+
DT2 = data.table(id = i64_val)
20674+
test(2300.1, DT1[DT2, on='id', verbose=TRUE], DT2, output="has integer64 representation")
20675+
test(2300.2, DT2[DT1, on='id', verbose=TRUE], DT1, output="has integer64 representation")
20676+
DT2[, id := id+.01]
20677+
test(2300.3, DT1[DT2, on='id'], error="Incompatible join types")
20678+
test(2300.4, DT2[DT1, on='id'], error="Incompatible join types")
20679+
}

src/between.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@ SEXP between(SEXP x, SEXP lower, SEXP upper, SEXP incbounds, SEXP NAboundsArg, S
3030
const bool verbose = GetVerbose();
3131

3232
if (isInteger(x)) {
33-
if ((isInteger(lower) || isRealReallyInt(lower)) &&
34-
(isInteger(upper) || isRealReallyInt(upper))) { // #3517 coerce to num to int when possible
33+
if ((isInteger(lower) || fitsInInt32(lower)) &&
34+
(isInteger(upper) || fitsInInt32(upper))) { // #3517 coerce to num to int when possible
3535
if (!isInteger(lower)) {
3636
lower = PROTECT(coerceVector(lower, INTSXP)); nprotect++;
3737
}

src/data.table.h

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -243,9 +243,10 @@ SEXP coalesce(SEXP x, SEXP inplace);
243243
// utils.c
244244
bool within_int32_repres(double x);
245245
bool within_int64_repres(double x);
246-
bool isRealReallyInt(SEXP x);
247-
SEXP isRealReallyIntR(SEXP x);
248-
SEXP isReallyReal(SEXP x);
246+
bool fitsInInt32(SEXP x);
247+
SEXP fitsInInt32R(SEXP x);
248+
bool fitsInInt64(SEXP x);
249+
SEXP fitsInInt64R(SEXP x);
249250
bool allNA(SEXP x, bool errorForBadType);
250251
SEXP colnamesInt(SEXP x, SEXP cols, SEXP check_dups, SEXP skip_absent);
251252
bool INHERITS(SEXP x, SEXP char_);

src/forder.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -595,12 +595,12 @@ SEXP forder(SEXP DT, SEXP by, SEXP retGrpArg, SEXP retStatsArg, SEXP sortGroupsA
595595
if (INHERITS(x, char_integer64)) {
596596
range_i64((int64_t *)REAL(x), nrow, &min, &max, &na_count);
597597
} else {
598-
if (verbose && INHERITS(x, char_Date) && INTEGER(isReallyReal(x))[0]==0) {
599-
Rprintf(_("\n*** Column %d passed to forder is a date stored as an 8 byte double but no fractions are present. Please consider a 4 byte integer date such as IDate to save space and time.\n"), col+1);
600-
// Note the (slightly expensive) isReallyReal will only run when verbose is true. Prefix '***' just to make it stand out in verbose output
598+
if (verbose && INHERITS(x, char_Date) && fitsInInt32(x)) {
599+
// Note the (slightly expensive) fitsInInt32 will only run when verbose is true. Prefix '***' just to make it stand out in verbose output
601600
// In future this could be upgraded to option warning. But I figured that's what we use verbose to do (to trace problems and look for efficiencies).
602601
// If an automatic coerce is desired (see discussion in #1738) then this is the point to do that in this file. Move the INTSXP case above to be
603602
// next, do the coerce of Date to integer now to a tmp, and then let this case fall through to INTSXP in the same way as CPLXSXP falls through to REALSXP.
603+
Rprintf(_("\n*** Column %d passed to forder is a date stored as an 8 byte double but no fractions are present. Please consider a 4 byte integer date such as IDate to save space and time.\n"), col+1);
604604
}
605605
range_d(REAL(x), nrow, &min, &max, &na_count, &infnan_count);
606606
if (min==0 && na_count<nrow) { min=3; max=4; } // column contains no finite numbers and is not-all NA; create dummies to yield positive min-2 later

src/frollR.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@ SEXP frollapplyR(SEXP fun, SEXP obj, SEXP k, SEXP fill, SEXP align, SEXP rho) {
229229

230230
if (!isInteger(k)) {
231231
if (isReal(k)) {
232-
if (isRealReallyInt(k)) {
232+
if (fitsInInt32(k)) {
233233
SEXP ik = PROTECT(coerceVector(k, INTSXP)); protecti++;
234234
k = ik;
235235
} else {

0 commit comments

Comments
 (0)