Skip to content

Commit 67670e9

Browse files
cbindlist, mergelist (#4370)
* cbindlist * add cbind by reference, timing * R prototype of mergelist * wording * use lower overhead funs * stick to int32 for now, correct R_alloc * bmerge C refactor for codecov and one loop for speed * address revealed codecov gaps * refactor vecseq for codecov * seqexp helper, some alloccol export on C * bmerge codecov, types handled in R bmerge already * better comment seqexp * bmerge mult=error #655 * multiple new C utils * swap if branches * explain new C utils * comments mostly * reduce conflicts to PR #4386 * comment C code * address multiple matches during update-on-join #3747 * Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. * merge.dt has temporarily mult arg, for testing * minor changes to cbindlist c * dev mergelist, for single pair now * add quiet option to cc() * mergelist tests * add check for names to perhaps.dt * rm mult from merge.dt method * rework, clean, polish multer, fix righ and full joins * make full join symmetric * mergepair inner function to loop on * extra check for symmetric * mergelist manual * ensure no df-dt passed where list expected * comments and manual * handle 0 cols tables * more tests * more tests and debugging * move more logic closer to bmerge, simplify mergepair * more tests * revert not used changes * reduce not needed checks, cleanup * copy arg behavior, manual, no tests yet * cbindlist manual, export both * cleanup processing bmerge to dtmatch * test function match order for easier preview * vecseq gets short-circuit * batch test allow browser * big cleanup * remmove unneeded stuff, reduce diff * more cleanup, minor manual fixes * add proper test scripts * comment out not used code for coverage * more tests, some nocopy opts * rename sql test script, should fix codecov * simplify dtmatch inner branch * more precise copy, now copy only T or F * unused arg not yet in api, wording * comments and refer issues * codecov * hasindex coverage * codecov gap * tests for join using key, cols argument * fix missing import forderv * more tests, improve missing on handling * more tests for order of inner and full join for long keys * new allow.cartesian option, #4383, #914 * reduce diff, improve codecov * reduce diff, comments * need more DT, not lists, mergelist 3+ tbls * proper escape heavy check * unit tests * more tests, address overalloc failure * mergelist and cbindlist retain index * manual, examples * fix manual * minor clarify in manual * retain keys, right outer join for snowflake schema joins * duplicates in cbindlist * recycling in cbindlist * escape 0 input in copyCols * empty input handling * closing cbindlist * vectorized _on_ and _join.many_ arg * rename dtmatch to dtmerge * vectorized args: how, mult push down input validation add support for cross join, semi join, anti join * full join, reduce overhead for mult=error * mult default value dynamic * fix manual * add "see details" to Rd * mention shared on in arg description * amend feedback from Michael * semi and anti joins will not reorder x columns * spelling, thx to @jan-glx * check all new funs used and add comments * bugfix, sort=T needed for now * Update NEWS.md * NEWS placement * numbering * ascArg->order * attempt to restore from master * Update to stopf() error style * Need isFrame for now * More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} * really restore from master * try to PROTECT() before duplicate() * update error message in test * appease the rchk gods * extraneous space * missing ';' * use catf * simplify perhapsDataTableR * move sqlite.Rraw.manual into other.Rraw * simplify for loop * first pass at publishable NEWS * ws * failed merge * failed merge pt ii * shrink diff * pass at style * Ditch mergelist(copy=) for setmergelist * Put cols=NULL default into the signature to avoid missing() quirks * Explain 'NULL' in cols= in Rd * First pass on grammar for \arguments * finish style+grammar pass * restore 'join.many' to signature * use 'try' for known error in example * tweak examples * Add \references for star/snowflake schema terminology * fix test error messages, remove extra '[]' from brackify errors * rm unreachable error * coverage * first pass at local() style tests * linted style * semicolons, spacing * rearrange tests using options to be in nested local() calls * restore new 'l' for rearranged tests; re-capture test using 'l' in local() * Jan's clarifying comment * Another pass at style, annotation; remove some duplicate tests * more refinement of test structure, comments * finished mergelist.Rraw * more whitespace in constructed SQL queries * style, continued * more formal styling with lintr * update reference to other.Rraw tests * return output invisibly for set* functions * mention setmergelist in NEWS * numbering --------- Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> Co-authored-by: Michael Chirico <chiricom@google.com>
1 parent 8ea8f72 commit 67670e9

File tree

8 files changed

+1525
-7
lines changed

8 files changed

+1525
-7
lines changed

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ export(nafill)
5959
export(setnafill)
6060
export(.Last.updated)
6161
export(fcoalesce)
62+
export(mergelist, setmergelist)
6263
export(cbindlist, setcbindlist)
6364
export(substitute2)
6465
#export(DT) # mtcars |> DT(i,j,by) #4872 #5472

NEWS.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,10 @@
4848

4949
11. New `frev(x)` as a faster analogue to `base::rev()` for atomic vectors/lists, [#5885](https://github.com/Rdatatable/data.table/issues/5885). Twice as fast as `base::rev()` on large inputs, and faster with more threads. Thanks to Benjamin Schwendinger for suggesting and implementing.
5050

51+
12. New `cbindlist()` and `setcbindlist()` for concatenating a `list` of data.tables column-wise, evocative of the analogous `do.call(rbind, l)` <-> `rbindlist(l)`, [#2576](https://github.com/Rdatatable/data.table/issues/2576). `setcbindlist()` does so without making any copies. Thanks @MichaelChirico for the FR, @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.
52+
53+
13. New `mergelist()` and `setmergelist()` similarly work _a la_ `Reduce()` to recursively merge a `list` of data.tables, [#599](https://github.com/Rdatatable/data.table/issues/599). Different join modes (_left_, _inner_, _full_, _right_, _semi_, _anti_, and _cross_) are supported through the `how` argument; duplicate handling goes through the `mult` argument. `setmergelist()` carefully avoids copies where one is not needed, e.g. in a 1:1 left join. Thanks Patrick Nicholson for the FR (in 2013!), @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.
54+
5155
### BUG FIXES
5256

5357
1. `fread()` no longer warns on certain systems on R 4.5.0+ where the file owner can't be resolved, [#6918](https://github.com/Rdatatable/data.table/issues/6918). Thanks @ProfFancyPants for the report and PR.

R/data.table.R

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ replace_dot_alias = function(e) {
221221
}
222222
return(x)
223223
}
224-
if (!mult %chin% c("first", "last", "all")) stopf("mult argument can only be 'first', 'last' or 'all'")
224+
if (!mult %chin% c("first", "last", "all", "error")) stopf("mult argument can only be 'first', 'last', 'all' or 'error'")
225225
missingroll = missing(roll)
226226
if (length(roll)!=1L || is.na(roll)) stopf("roll must be a single TRUE, FALSE, positive/negative integer/double including +Inf and -Inf or 'nearest'")
227227
if (is.character(roll)) {
@@ -520,6 +520,7 @@ replace_dot_alias = function(e) {
520520
}
521521
i = .shallow(i, retain.key = TRUE)
522522
ans = bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, ops, verbose=verbose)
523+
if (mult == "error") mult = "all" ## error should have been raised inside bmerge() call above already, if it wasn't continue as mult="all"
523524
xo = ans$xo ## to make it available for further use.
524525
# temp fix for issue spotted by Jan, test #1653.1. TODO: avoid this
525526
# 'setorder', as there's another 'setorder' in generating 'irows' below...

R/mergelist.R

Lines changed: 102 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ cbindlist_impl_ = function(l, copy) {
99
}
1010

1111
cbindlist = function(l) cbindlist_impl_(l, copy=TRUE)
12-
setcbindlist = function(l) cbindlist_impl_(l, copy=FALSE)
12+
setcbindlist = function(l) invisible(cbindlist_impl_(l, copy=FALSE))
1313

1414
# when 'on' is missing then use keys, used only for inner and full join
1515
onkeys = function(x, y) {
@@ -157,9 +157,9 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
157157
stopf("'on' is missing and necessary key is not present")
158158
}
159159
if (any(bad.on <- !on %chin% names(lhs)))
160-
stopf("'on' argument specifies columns to join [%s] that are not present in %s table [%s]", brackify(on[bad.on]), "LHS", brackify(names(lhs)))
160+
stopf("'on' argument specifies columns to join %s that are not present in %s table %s", brackify(on[bad.on]), "LHS", brackify(names(lhs)))
161161
if (any(bad.on <- !on %chin% names(rhs)))
162-
stopf("'on' argument specifies columns to join [%s] that are not present in %s table [%s]", brackify(on[bad.on]), "RHS", brackify(names(rhs)))
162+
stopf("'on' argument specifies columns to join %s that are not present in %s table %s", brackify(on[bad.on]), "RHS", brackify(names(rhs)))
163163
} else if (is.null(on)) {
164164
on = character() ## cross join only
165165
}
@@ -203,7 +203,7 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
203203
copy_x = TRUE
204204
## ensure no duplicated column names in merge results
205205
if (any(dup.i <- names(out.i) %chin% names(out.x)))
206-
stopf("merge result has duplicated column names [%s], use 'cols' argument or rename columns in 'l' tables", brackify(names(out.i)[dup.i]))
206+
stopf("merge result has duplicated column names %s, use 'cols' argument or rename columns in 'l' tables", brackify(names(out.i)[dup.i]))
207207
}
208208

209209
## stack i and x
@@ -257,6 +257,104 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
257257
setDT(out)
258258
}
259259

260+
mergelist_impl_ = function(l, on, cols, how, mult, join.many, copy) {
261+
verbose = getOption("datatable.verbose")
262+
if (verbose)
263+
p = proc.time()[[3L]]
264+
265+
if (!is.list(l) || is.data.frame(l))
266+
stopf("'%s' must be a list", "l")
267+
if (!all(vapply_1b(l, is.data.table)))
268+
stopf("Every element of 'l' list must be data.table objects")
269+
if (!all(idx <- lengths(l) > 0L))
270+
stopf("Tables in 'l' must all have columns, but these entries have 0: %s", brackify(which(!idx)))
271+
if (any(idx <- vapply_1i(l, function(x) anyDuplicated(names(x))) > 0L))
272+
stopf("Column names in individual 'l' entries must be unique, but these have some duplicates: %s", brackify(which(idx)))
273+
274+
n = length(l)
275+
if (n < 2L) {
276+
out = if (n) l[[1L]] else as.data.table(l)
277+
if (copy) out = copy(out)
278+
if (verbose)
279+
catf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p)
280+
return(out)
281+
}
282+
283+
if (!is.list(join.many))
284+
join.many = rep(list(join.many), n - 1L)
285+
if (length(join.many) != n - 1L || !all(vapply_1b(join.many, isTRUEorFALSE)))
286+
stopf("'join.many' must be TRUE or FALSE, or a list of such whose length must be length(l)-1L")
287+
288+
if (missing(mult))
289+
mult = NULL
290+
if (!is.list(mult))
291+
mult = rep(list(mult), n - 1L)
292+
if (length(mult) != n - 1L || !all(vapply_1b(mult, function(x) is.null(x) || (is.character(x) && length(x) == 1L && !anyNA(x) && x %chin% c("error", "all", "first", "last")))))
293+
stopf("'mult' must be one of [error, all, first, last] or NULL, or a list of such whose length must be length(l)-1L")
294+
295+
if (!is.list(how))
296+
how = rep(list(how), n-1L)
297+
if (length(how)!=n-1L || !all(vapply_1b(how, function(x) is.character(x) && length(x)==1L && !anyNA(x) && x %chin% c("left", "inner", "full", "right", "semi", "anti", "cross"))))
298+
stopf("'how' must be one of [left, inner, full, right, semi, anti, cross], or a list of such whose length must be length(l)-1L")
299+
300+
if (is.null(cols)) {
301+
cols = vector("list", n)
302+
} else {
303+
if (!is.list(cols))
304+
stopf("'%s' must be a list", "cols")
305+
if (length(cols) != n)
306+
stopf("'cols' must be same length as 'l' (%d != %d)", length(cols), n)
307+
skip = vapply_1b(cols, is.null)
308+
if (!all(vapply_1b(cols[!skip], function(x) is.character(x) && !anyNA(x) && !anyDuplicated(x))))
309+
stopf("'cols' must be a list of non-zero length, non-NA, non-duplicated, character vectors, or eventually NULLs (all columns)")
310+
if (any(mapply(function(x, icols) !all(icols %chin% names(x)), l[!skip], cols[!skip])))
311+
stopf("'cols' specify columns not present in corresponding table")
312+
}
313+
314+
if (missing(on) || is.null(on)) {
315+
on = vector("list", n - 1L)
316+
} else {
317+
if (!is.list(on))
318+
on = rep(list(on), n - 1L)
319+
if (length(on) != n-1L || !all(vapply_1b(on, function(x) is.character(x) && !anyNA(x) && !anyDuplicated(x)))) ## length checked in dtmerge
320+
stopf("'on' must be non-NA, non-duplicated, character vector, or a list of such which length must be length(l)-1L")
321+
}
322+
323+
l.mem = lapply(l, vapply, address, "")
324+
out = l[[1L]]
325+
out.cols = cols[[1L]]
326+
for (join.i in seq_len(n - 1L)) {
327+
rhs.i = join.i + 1L
328+
out = mergepair(
329+
lhs = out, rhs = l[[rhs.i]],
330+
on = on[[join.i]],
331+
how = how[[join.i]], mult = mult[[join.i]],
332+
lhs.cols = out.cols, rhs.cols = cols[[rhs.i]],
333+
copy = FALSE, ## avoid any copies inside, will copy once below
334+
join.many = join.many[[join.i]],
335+
verbose = verbose
336+
)
337+
out.cols = copy(names(out))
338+
}
339+
out.mem = vapply_1c(out, address)
340+
if (copy)
341+
.Call(CcopyCols, out, colnamesInt(out, names(out.mem)[out.mem %chin% unique(unlist(l.mem, recursive=FALSE))]))
342+
if (verbose)
343+
catf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]] - p)
344+
out
345+
}
346+
347+
mergelist = function(l, on, cols=NULL, how=c("left", "inner", "full", "right", "semi", "anti", "cross"), mult, join.many=getOption("datatable.join.many")) {
348+
if (missing(how) || is.null(how))
349+
how = match.arg(how)
350+
mergelist_impl_(l, on, cols, how, mult, join.many, copy=TRUE)
351+
}
352+
setmergelist = function(l, on, cols=NULL, how=c("left", "inner", "full", "right", "semi", "anti", "cross"), mult, join.many=getOption("datatable.join.many")) {
353+
if (missing(how) || is.null(how))
354+
how = match.arg(how)
355+
invisible(mergelist_impl_(l, on, cols, how, mult, join.many, copy=FALSE))
356+
}
357+
260358
# Previously, we had a custom C implementation here, which is ~2x faster,
261359
# but this is fast enough we don't bother maintaining a new routine.
262360
# Hopefully in the future rep() can recognize the ALTREP and use that, too.

0 commit comments

Comments
 (0)