Skip to content

Commit ded8afb

Browse files
authored
Merge branch 'master' into sritchie73-non-equi-key
2 parents b5bfc39 + afb99aa commit ded8afb

File tree

9 files changed

+177
-49
lines changed

9 files changed

+177
-49
lines changed

NEWS.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,20 @@
1919
2020
2. `fread()` can now read a remote compressed file in one step; `fread("https://domain.org/file.csv.bz2")`. The `file=` argument now supports `.gz` and `.bz2` too; i.e. `fread(file="file.csv.gz")` works now where only `fread("file.csv.gz")` worked in 1.11.8.
2121
22+
2. `nomatch=NULL` now does the same as `nomatch=0L`; i.e. discards missing values silently (inner join). The default is still `nomatch=NA` (outer join) for statistical safety so that missing values are retained by default. You have to explicitly write `nomatch=NULL` to indicate to the reader of your code that you intend to discard missing values silently. After several years have elapsed, we will start to deprecate `0L`; please start using `NULL`. TO DO ... `nomatch=.(0)` fills with `0` instead of `NA`, [#857](https://github.com/Rdatatable/data.table/issues/857) and `nomatch="error"`.
23+
2224
#### BUG FIXES
2325
2426
1. Providing an `i` subset expression when attempting to delete a column correctly failed with helpful error, but when the column was missing too created a new column full of `NULL` values, [#3089](https://github.com/Rdatatable/data.table/issues/3089). Thanks to Michael Chirico for reporting.
2527
28+
2. Column names that look like expressions (e.g. `"a<=colB"`) caused an error when used in `on=` even when wrapped with backticks, [#3092](https://github.com/Rdatatable/data.table/issues/3092). Additionally, `on=` now supports white spaces around operators; e.g. `on = "colA == colB"`. Thanks to @mt1022 for reporting and to @MarkusBonsch for fixing.
29+
2630
#### NOTES
2731
2832
1. When data.table first loads it now checks the DLL's MD5. This is to detect installation issues on Windows when you upgrade and i) the DLL is in use by another R session and ii) the CRAN source version > CRAN binary binary which happens just after a new release (R prompts users to install from source until the CRAN binary is available). This situation can lead to a state where the package's new R code calls old C code in the old DLL; [R#17478](https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17478), [#3056](https://github.com/Rdatatable/data.table/issues/3056). This broken state can persist until, hopefully, you experience a strange error caused by the mismatch. Otherwise, wrong results may occur silently. This situation applies to any R package with compiled code not just data.table, is Windows-only, and is long-standing. It has only recently been understood as it typically only occurs during the few days after each new release until binaries are available on CRAN. Thanks to Gabor Csardi for the suggestion to use `tools::checkMD5sums()`.
2933
34+
2. When `on=` is provided but not `i=`, a helpful error is now produced rather than silently ignoring `on=`. Thanks to Dirk Eddelbuettel for the idea.
35+
3036
3137
### Changes in v1.11.8
3238

R/data.table.R

Lines changed: 109 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -247,17 +247,22 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
247247
if (length(rollends)==1L) rollends=rep.int(rollends,2L)
248248
# TO DO (document/faq/example). Removed for now ... if ((roll || rolltolast) && missing(mult)) mult="last" # for when there is exact match to mult. This does not control cases where the roll is mult, that is always the last one.
249249
missingnomatch = missing(nomatch)
250-
if (!is.na(nomatch) && nomatch!=0L) stop("nomatch must either be NA or 0, or (ideally) NA_integer_ or 0L")
250+
if (is.null(nomatch)) nomatch = 0L # allow nomatch=NULL API already now, part of: https://github.com/Rdatatable/data.table/issues/857
251+
if (!is.na(nomatch) && nomatch!=0L) stop("nomatch= must be either NA or NULL (or 0 for backwards compatibility which is the same as NULL)")
251252
nomatch = as.integer(nomatch)
252-
if (!is.logical(which) || length(which)>1L) stop("'which' must be a logical vector length 1. Either FALSE, TRUE or NA.")
253-
if ((isTRUE(which)||is.na(which)) && !missing(j)) stop("'which' is ",which," (meaning return row numbers) but 'j' is also supplied. Either you need row numbers or the result of j, but only one type of result can be returned.")
253+
if (!is.logical(which) || length(which)>1L) stop("which= must be a logical vector length 1. Either FALSE, TRUE or NA.")
254+
if ((isTRUE(which)||is.na(which)) && !missing(j)) stop("which==",which," (meaning return row numbers) but j is also supplied. Either you need row numbers or the result of j, but only one type of result can be returned.")
254255
if (!is.na(nomatch) && is.na(which)) stop("which=NA with nomatch=0 would always return an empty vector. Please change or remove either which or nomatch.")
255256
.global$print=""
256257
if (missing(i) && missing(j)) {
257258
# ...[] == oops at console, forgot print(...)
258259
# or some kind of dynamic construction that has edge case of no contents inside [...]
260+
if (nargs()>2L) # 2 is minimum: 1) method name, 2) x
261+
stop("When i and j are both missing, no other argument should be used. Empty [] is useful after := to have the result printed.")
259262
return(x)
260263
}
264+
if (!with && missing(j)) stop("j must be provided when with=FALSE")
265+
if (missing(i) && !missing(on)) stop("i must be provided when on= is provided")
261266
if (!missing(keyby)) {
262267
if (!missing(by)) stop("Provide either 'by' or 'keyby' but not both")
263268
by=bysub=substitute(keyby)
@@ -275,7 +280,6 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
275280
notjoin = FALSE
276281
rightcols = leftcols = integer()
277282
optimizedSubset = FALSE ## flag: tells, whether a normal query was optimized into a join.
278-
if (!with && missing(j)) stop("j must be provided when with=FALSE")
279283
..syms = NULL
280284
av = NULL
281285
jsub = NULL
@@ -484,40 +488,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
484488
}
485489
if (!missing(on)) {
486490
# on = .() is now possible, #1257
487-
parse_on <- function(onsub) {
488-
ops = c("==", "<=", "<", ">=", ">", "!=")
489-
pat = paste0("(", ops, ")", collapse="|")
490-
if (is.call(onsub) && onsub[[1L]] == "eval") {
491-
onsub = eval(onsub[[2L]], parent.frame(2L), parent.frame(2L))
492-
if (is.call(onsub) && onsub[[1L]] == "eval") onsub = onsub[[2L]]
493-
}
494-
if (is.call(onsub) && as.character(onsub[[1L]]) %in% c("list", ".")) {
495-
spat = paste0("[ ]+(", pat, ")[ ]+")
496-
onsub = lapply(as.list(onsub)[-1L], function(x) gsub(spat, "\\1", deparse(x, width.cutoff=500L)))
497-
onsub = as.call(c(quote(c), onsub))
498-
}
499-
on = eval(onsub, parent.frame(2L), parent.frame(2L))
500-
if (!is.character(on))
501-
stop("'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.")
502-
this_op = regmatches(on, gregexpr(pat, on))
503-
idx = (vapply(this_op, length, 0L) == 0L)
504-
this_op[idx] = "=="
505-
this_op = unlist(this_op, use.names=FALSE)
506-
idx_op = match(this_op, ops, nomatch=0L)
507-
if (any(idx_op %in% c(0L, 6L)))
508-
stop("Invalid operators ", paste(this_op[idx_op==0L], collapse=","), ". Only allowed operators are ", paste(ops[1:5], collapse=""), ".")
509-
if (is.null(names(on))) {
510-
on[idx] = if (isnull_inames) paste(on[idx], paste0("V", seq_len(sum(idx))), sep="==") else paste(on[idx], on[idx], sep="==")
511-
} else {
512-
on[idx] = paste(names(on)[idx], on[idx], sep="==")
513-
}
514-
split = tstrsplit(on, paste0("[ ]*", pat, "[ ]*"))
515-
on = setattr(split[[2L]], 'names', split[[1L]])
516-
if (length(empty_idx <- which(names(on) == "")))
517-
names(on)[empty_idx] = on[empty_idx]
518-
list(on = on, ops = idx_op)
519-
}
520-
on_ops = parse_on(substitute(on))
491+
on_ops = .parse_on(substitute(on), isnull_inames)
521492
on = on_ops[[1L]]
522493
ops = on_ops[[2L]]
523494
# TODO: collect all '==' ops first to speeden up Cnestedid
@@ -3110,3 +3081,103 @@ isReallyReal <- function(x) {
31103081
)
31113082
)
31123083
}
3084+
3085+
3086+
.parse_on <- function(onsub, isnull_inames) {
3087+
## helper that takes the 'on' string(s) and extracts comparison operators and column names from it.
3088+
#' @param onsub the substituted on
3089+
#' @param isnull_inames bool; TRUE if i has no names.
3090+
#' @return List with two entries:
3091+
#' 'on' : character vector providing the column names for the join.
3092+
#' Names correspond to columns in x, entries correspond to columns in i
3093+
#' 'ops': integer vector. Gives the indices of the operators that connect the columns in x and i.
3094+
ops = c("==", "<=", "<", ">=", ">", "!=")
3095+
pat = paste0("(", ops, ")", collapse="|")
3096+
if (is.call(onsub) && onsub[[1L]] == "eval") {
3097+
onsub = eval(onsub[[2L]], parent.frame(2L), parent.frame(2L))
3098+
if (is.call(onsub) && onsub[[1L]] == "eval") { onsub = onsub[[2L]] }
3099+
}
3100+
if (is.call(onsub) && as.character(onsub[[1L]]) %in% c("list", ".")) {
3101+
spat = paste0("[ ]+(", pat, ")[ ]+")
3102+
onsub = lapply(as.list(onsub)[-1L], function(x) gsub(spat, "\\1", deparse(x, width.cutoff=500L)))
3103+
onsub = as.call(c(quote(c), onsub))
3104+
}
3105+
on = eval(onsub, parent.frame(2L), parent.frame(2L))
3106+
if (!is.character(on))
3107+
stop("'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.")
3108+
## extract the operators and potential variable names from 'on'.
3109+
## split at backticks to take care about variable names like `col1<=`.
3110+
pieces <- strsplit(on, "(?=[`])", perl = TRUE)
3111+
xCols <- character(length(on))
3112+
## if 'on' is named, the names are the xCols for sure
3113+
if(!is.null(names(on))){
3114+
xCols <- names(on)
3115+
}
3116+
iCols <- character(length(on))
3117+
operators <- character(length(on))
3118+
## loop over the elements and extract operators and column names.
3119+
for(i in seq_along(pieces)){
3120+
thisCols <- character(0)
3121+
thisOperators <- character(0)
3122+
j <- 1
3123+
while(j <= length(pieces[[i]])){
3124+
if(pieces[[i]][j] == "`"){
3125+
## start of a variable name with backtick.
3126+
thisCols <- c(thisCols, pieces[[i]][j+1])
3127+
j <- j+3 # +1 is the column name, +2 is delimiting "`", +3 is next relevant entry.`
3128+
} else {
3129+
## no backtick
3130+
## search for operators
3131+
thisOperators <- c(thisOperators,
3132+
unlist(regmatches(pieces[[i]][j], gregexpr(pat, pieces[[i]][j])),
3133+
use.names = FALSE))
3134+
## search for column names
3135+
thisCols <- c(thisCols, trimws(strsplit(pieces[[i]][j], pat)[[1]]))
3136+
## there can be empty string column names because of trimws, remove them
3137+
thisCols <- thisCols[thisCols != ""]
3138+
j <- j+1
3139+
}
3140+
}
3141+
if (length(thisOperators) == 0) {
3142+
## if no operator is given, it must be ==
3143+
operators[i] <- "=="
3144+
} else if (length(thisOperators) == 1) {
3145+
operators[i] <- thisOperators
3146+
} else {
3147+
## multiple operators found in one 'on' part. Something is wrong.
3148+
stop("Found more than one operator in one 'on' statement: ", on[i], ". Please specify a single operator.")
3149+
}
3150+
if (length(thisCols) == 2){
3151+
## two column names found, first is xCol, second is iCol for sure
3152+
xCols[i] <- thisCols[1]
3153+
iCols[i] <- thisCols[2]
3154+
} else if (length(thisCols) == 1){
3155+
## a single column name found. Can mean different things
3156+
if(xCols[i] != ""){
3157+
## xCol is given by names(on). thisCols must be iCol
3158+
iCols[i] <- thisCols[1]
3159+
} else if (isnull_inames){
3160+
## i has no names. It will be given the names V1, V2, ... automatically.
3161+
## The single column name is the x column. It will match to the ith column in i.
3162+
xCols[i] <- thisCols[1]
3163+
iCols[i] <- paste0("V", i)
3164+
} else {
3165+
## i has names and one single column name is given by on.
3166+
## This means that xCol and iCol have the same name.
3167+
xCols[i] <- thisCols[1]
3168+
iCols[i] <- thisCols[1]
3169+
}
3170+
} else if (length(thisCols) == 0){
3171+
stop("'on' contains no column name: ", on[i], ". Each 'on' clause must contain one or two column names.")
3172+
} else {
3173+
stop("'on' contains more than 2 column names: ", on[i], ". Each 'on' clause must contain one or two column names.")
3174+
}
3175+
}
3176+
idx_op = match(operators, ops, nomatch=0L)
3177+
if (any(idx_op %in% c(0L, 6L)))
3178+
stop("Invalid operators ", paste(operators[idx_op %in% c(0L, 6L)], collapse=","), ". Only allowed operators are ", paste(ops[1:5], collapse=""), ".")
3179+
## the final on will contain the xCol as name, the iCol as value
3180+
on <- iCols
3181+
names(on) <- xCols
3182+
return(list(on = on, ops = idx_op))
3183+
}

R/setops.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,10 +65,10 @@ fintersect <- function(x, y, all=FALSE) {
6565
x = shallow(x)[, ".seqn" := rowidv(x)]
6666
y = shallow(y)[, ".seqn" := rowidv(y)]
6767
jn.on = c(".seqn",setdiff(names(x),".seqn"))
68-
x[y, .SD, .SDcols=setdiff(names(x),".seqn"), nomatch=0L, on=jn.on]
68+
x[y, .SD, .SDcols=setdiff(names(x),".seqn"), nomatch=NULL, on=jn.on]
6969
} else {
7070
z = funique(y) # fixes #3034. When .. prefix in i= is implemented (TODO), this can be x[funique(..y), on=, multi=]
71-
x[z, nomatch=0L, on=names(x), mult="first"]
71+
x[z, nomatch=NULL, on=names(x), mult="first"]
7272
}
7373
}
7474

inst/tests/tests.Rraw

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12278,6 +12278,57 @@ DT = data.table(A=1:5)
1227812278
test(1947.1, DT[A<0, c('A','B'):=.(NULL, A)], error="When deleting columns, i should not be provided")
1227912279
test(1947.2, DT, data.table(A=1:5))
1228012280

12281+
## tests for backticks and spaces in column names of on=, #2931
12282+
DT <- data.table(id = 1:3, `counts(a>=0)` = 1:3, sameName = 1:3)
12283+
i <- data.table(idi = 1:3, ` weirdName>=` = 1:3, sameName = 1:3)
12284+
## test white spaces around operator
12285+
test(1948.1, DT[i, on = "id >= idi"], DT[i, on = "id>=idi"])
12286+
test(1948.2, DT[i, on = "id>= idi"], DT[i, on = "id>=idi"])
12287+
test(1948.3, DT[i, on = "id >=idi"], DT[i, on = "id>=idi"])
12288+
## test column names containing operators
12289+
test(1948.4, setnames(DT[i, on = "id>=` weirdName>=`"], c("id","counts(a>=0)", "sameName", " weirdName>=", "i.sameName")),
12290+
DT[i, on = "id>=idi"])
12291+
test(1948.5, setnames(DT[i, on = "id>=` weirdName>=`"], c("id","counts(a>=0)", "sameName", " weirdName>=", "i.sameName")),
12292+
DT[i, on = "id>=idi"])
12293+
test(1948.6, setnames(DT[i, on = "id >= ` weirdName>=`"], c("id","counts(a>=0)", "sameName", " weirdName>=", "i.sameName")),
12294+
DT[i, on = "id>=idi"])
12295+
test(1948.7, setnames(DT[i, on = "`counts(a>=0)`==` weirdName>=`"], c("id","counts(a>=0)", "sameName", " weirdName>=", "i.sameName")),
12296+
DT[i, on = "id==idi"])
12297+
## mixed example
12298+
test(1948.8, DT[i, on = c( id = "idi", "sameName", "`counts(a>=0)`==` weirdName>=`")], DT[i, on = "id==idi", c("id", "counts(a>=0)", "sameName")])
12299+
## testing 'eval' in on clause
12300+
test(1948.9, DT[i, on = eval(eval("id<=idi"))], DT[i, on = "id<=idi"])
12301+
## testing for errors
12302+
test(1948.11, DT[i, on = ""], error = "'on' contains no column name: . Each 'on' clause must contain one or two column names.")
12303+
test(1948.12, DT[i, on = "id>=idi>=1"], error = "Found more than one operator in one 'on' statement: id>=idi>=1. Please specify a single operator.")
12304+
test(1948.13, DT[i, on = "`id``idi`<=id"], error = "'on' contains more than 2 column names: `id``idi`<=id. Each 'on' clause must contain one or two column names.")
12305+
test(1948.14, DT[i, on = "id != idi"], error = "Invalid operators !=. Only allowed operators are ==<=<>=>.")
12306+
test(1948.15, DT[i, on = 1L], error = "'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.")
12307+
12308+
# helpful error when on= is provided but not i, rather than silently ignoring on=
12309+
test(1949.1, DT[,,on=A], error="When i and j are both missing, no other argument should be used.")
12310+
test(1949.2, DT[,1,on=A], error="i must be provided when on= is provided")
12311+
test(1949.3, DT[1,,with=FALSE], error="j must be provided when with=FALSE")
12312+
12313+
if (test_bit64) {
12314+
# explicit coverage of 2-column real case in uniqlist. Keeps coming up in codecov checks in PRs that don't touch uniqlist.c
12315+
DT = data.table(id=c("A","A","B","B","B"), v=as.integer64(c(1,2,3,3,4)))
12316+
test(1950, uniqlist(DT), INT(1,2,3,5))
12317+
}
12318+
12319+
# allow nomatch=NULL to work same as nomatch=0L, #857
12320+
d1 = data.table(a=1:3, b=2:4)
12321+
d2 = data.table(a=2:4, b=3:5)
12322+
test(1951.1, d1[d2, on="a", nomatch=NULL], d1[d2, on="a", nomatch=0L])
12323+
test(1951.2, d1[d2, on="b", nomatch=NULL], d1[d2, on="b", nomatch=0L])
12324+
test(1951.3, d1[d2, on=c("a","b"), nomatch=NULL], d1[d2, on=c("a","b"), nomatch=0L])
12325+
test(1951.4, d1[d2, nomatch=3], error="nomatch= must be either NA or NULL .or 0 for backwards compatibility")
12326+
12327+
# coverage of which= checks
12328+
test(1952.1, d1[a==2, which=3], error="which= must be a logical vector length 1. Either FALSE, TRUE or NA.")
12329+
test(1952.2, d1[a==2, 2, which=TRUE], error="which==TRUE.*but j is also supplied")
12330+
12331+
1228112332
#
1228212333
# gap in test number for now to avoid merge conflicts. Matt will remove gap when PR merged.
1228312334
#

man/data.table.Rd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
117117
118118
When \code{j} is a character vector of column names, a numeric vector of column positions to select or of the form \code{startcol:endcol}, and the value returned is always a \code{data.table}. \code{with=FALSE} is not necessary anymore to select columns dynamically. Note that \code{x[, cols]} is equivalent to \code{x[, ..cols]} and to \code{x[, cols, with=FALSE]} and to \code{x[, .SD, .SDcols=cols]}.}
119119
120-
\item{nomatch}{ Same as \code{nomatch} in \code{\link{match}}. When a row in \code{i} has no match to \code{x}, \code{nomatch=NA} (default) means \code{NA} is returned. \code{0} means no rows will be returned for that row of \code{i}. Use \code{options(datatable.nomatch=0)} to change the default value (used when \code{nomatch} is not supplied).}
120+
\item{nomatch}{ Same as \code{nomatch} in \code{\link{match}}. When a row in \code{i} has no match to \code{x}, \code{nomatch=NA} (default) means \code{NA} is returned. \code{NULL} (or \code{0} for backward compatibility) means no rows will be returned for that row of \code{i}. Use \code{options(datatable.nomatch=NULL)} to change the default value (used when \code{nomatch} is not supplied).}
121121
122122
\item{mult}{ When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}) and \emph{multiple} rows in \code{x} match to the row in \code{i}, \code{mult} controls which are returned: \code{"all"} (default), \code{"first"} or \code{"last"}.}
123123
@@ -289,7 +289,7 @@ DT[x!="b" | y!=3] # not yet optimized, currently vector scan subset
289289
DT[.("b", 3), on=c("x", "y")] # join on columns x,y of DT; uses binary search (fast)
290290
DT[.("b", 3), on=.(x, y)] # same, but using on=.()
291291
DT[.("b", 1:2), on=c("x", "y")] # no match returns NA
292-
DT[.("b", 1:2), on=.(x, y), nomatch=0] # no match row is not returned
292+
DT[.("b", 1:2), on=.(x, y), nomatch=NULL] # no match row is not returned
293293
DT[.("b", 1:2), on=c("x", "y"), roll=Inf] # locf, nomatch row gets rolled by previous row
294294
DT[.("b", 1:2), on=.(x, y), roll=-Inf] # nocb, nomatch row gets rolled by next row
295295
DT["b", sum(v*y), on="x"] # on rows where DT$x=="b", calculate sum(v*y)
@@ -306,7 +306,7 @@ X
306306
307307
DT[X, on="x"] # right join
308308
X[DT, on="x"] # left join
309-
DT[X, on="x", nomatch=0] # inner join
309+
DT[X, on="x", nomatch=NULL] # inner join
310310
DT[!X, on="x"] # not join
311311
DT[X, on=c(y="v")] # join using column "y" of DT with column "v" of X
312312
DT[X, on="y==v"] # same as above (v1.9.8+)
@@ -353,7 +353,7 @@ kDT[.("a")] # same, .() is an alias for list()
353353
kDT[list("a")] # same
354354
kDT[.("a", 3)] # join to 2 columns
355355
kDT[.("a", 3:6)] # join 4 rows (2 missing)
356-
kDT[.("a", 3:6), nomatch=0] # remove missing
356+
kDT[.("a", 3:6), nomatch=NULL] # remove missing
357357
kDT[.("a", 3:6), roll=TRUE] # locf rolling join
358358
kDT[.("a", 3:6), roll=Inf] # same as above
359359
kDT[.("a", 3:6), roll=-Inf] # nocb rolling join

0 commit comments

Comments
 (0)