Skip to content

Commit fa19d2b

Browse files
[cbindlist/mergelist] mergepair workhorse (#6437)
* cbindlist add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list * cbindlist add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list * restore ws change * cbindlist add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list * cbindlist add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list * cbindlist add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list * cbindlist add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list Co-authored-by: Jan Gorecki <[email protected]> * restore ws * default is in onLoad now * local() testing * lost 'frac' * pass for style * simplify making out.{x,i} * also touch up 'out.r' handling, use better variable name * switch logical aggregation order * notcopied implementation * more style tweaks * new test: setequal table in different order * answer always sorted? * only reverses ordering for how='right' * eschew unless= argument to notcopied for readability --------- Co-authored-by: Jan Gorecki <[email protected]>
1 parent b01f0e0 commit fa19d2b

File tree

4 files changed

+377
-3
lines changed

4 files changed

+377
-3
lines changed

R/data.table.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -545,7 +545,7 @@ replace_dot_alias = function(e) {
545545
if (allLen1) {
546546
irows = f__
547547
} else {
548-
join.many = isTRUE(getOption("datatable.join.many", TRUE)) # #914, default TRUE for backward compatibility
548+
join.many = isTRUE(getOption("datatable.join.many")) # #914, default TRUE for backward compatibility
549549
anyDups = !notjoin &&
550550
(
551551
# #698. When notjoin=TRUE, ignore allow.cartesian. Rows in answer will never be > nrow(x).

R/mergelist.R

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,120 @@ dtmerge = function(x, i, on, how, mult, join.many, void=FALSE, verbose) {
143143
list(ans=ans, irows=irows, xrows=xrows)
144144
}
145145

146+
# atomic join between two tables
147+
mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=names(rhs), copy=TRUE, join.many=TRUE, verbose=FALSE) {
148+
semi_or_anti = how == "semi" || how == "anti"
149+
inner_or_full = how == "inner" || how == "full"
150+
151+
if (how != "cross") {
152+
if (is.null(on)) {
153+
if (how == "left" || semi_or_anti) on = key(rhs)
154+
else if (how == "right") on = key(lhs)
155+
else if (inner_or_full) on = onkeys(key(lhs), key(rhs))
156+
if (is.null(on))
157+
stopf("'on' is missing and necessary key is not present")
158+
}
159+
if (any(bad.on <- !on %chin% names(lhs)))
160+
stopf("'on' argument specifies columns to join [%s] that are not present in %s table [%s]", brackify(on[bad.on]), "LHS", brackify(names(lhs)))
161+
if (any(bad.on <- !on %chin% names(rhs)))
162+
stopf("'on' argument specifies columns to join [%s] that are not present in %s table [%s]", brackify(on[bad.on]), "RHS", brackify(names(rhs)))
163+
} else if (is.null(on)) {
164+
on = character() ## cross join only
165+
}
166+
167+
## join-to and join-from tables and columns (right outer join-->swap)
168+
if (how != "right") {
169+
join_from = lhs; from_cols = lhs.cols; join_to = rhs; to_cols = rhs.cols
170+
} else {
171+
join_from = rhs; from_cols = rhs.cols; join_to = lhs; to_cols = lhs.cols
172+
}
173+
174+
## ensure symmetric join for inner|full join, apply mult on both tables, bmerge do only 'x' table
175+
copy_i = FALSE ## copy marker of out.i
176+
if (inner_or_full && !is.null(mult) && (mult == "first" || mult == "last")) {
177+
join_from = fdistinct(join_from, on=on, mult=mult, cols=from_cols, copy=FALSE) ## might not copy when already unique by 'on'
178+
copy_i = nrow(join_from) != nrow(lhs) ## nrow(lhs) bc how='inner|full' so join_from=lhs
179+
} else if (how == "inner" && (is.null(mult) || mult == "error")) { ## we do this branch only to raise error from bmerge, we cannot use forder to just find duplicates because those duplicates might not have matching rows in another table, full join checks mult='error' during two non-void bmerges
180+
dtmerge(x=join_from, i=join_to, on=on, how=how, mult=mult, verbose=verbose, join.many=join.many, void=TRUE)
181+
}
182+
183+
## binary merge
184+
ans = dtmerge(x=join_to, i=join_from, on=on, how=how, mult=mult, verbose=verbose, join.many=join.many)
185+
186+
## make i side; avoid subsetting if possible
187+
cols_i = someCols(join_from, from_cols, keep=on, retain.order=semi_or_anti)
188+
if (is.null(ans$irows)) {
189+
out.i = .shallow(join_from, cols=cols_i, retain.key=TRUE)
190+
} else {
191+
out.i = .Call(CsubsetDT, join_from, ans$irows, cols_i)
192+
copy_i = TRUE
193+
}
194+
195+
## make x side
196+
copy_x = TRUE
197+
if (semi_or_anti) {
198+
out.x = list()
199+
} else {
200+
if (is.null(ans$xrows)) ## as of now xrows cannot be NULL #4409 thus nocov below
201+
internal_error("dtmerge()$xrows returned NULL, #4409 been resolved but related code has not been updated?") # nocov
202+
out.x = .Call(CsubsetDT, join_to, ans$xrows, someCols(join_to, to_cols, drop=on))
203+
copy_x = TRUE
204+
## ensure no duplicated column names in merge results
205+
if (any(dup.i <- names(out.i) %chin% names(out.x)))
206+
stopf("merge result has duplicated column names [%s], use 'cols' argument or rename columns in 'l' tables", brackify(names(out.i)[dup.i]))
207+
}
208+
209+
## stack i and x
210+
if (how != "full") {
211+
if (!copy_i && copy) out.i = copy(out.i)
212+
#if (!copy_x && copy) out.x = copy(out.x) ## as of now copy_x always TRUE, search for #4409 here
213+
out = .Call(Ccbindlist, list(out.i, out.x), FALSE)
214+
if (how == "right") setcolorder(out, neworder=c(on, names(out.x))) ## arrange columns: i.on, x.cols, i.cols
215+
} else { # how=="full"
216+
## we made left join side above, proceed to right join side, so swap tbls
217+
join_from = rhs; from_cols = rhs.cols; join_to = lhs; to_cols = lhs.cols
218+
219+
copy_r = FALSE
220+
if (!is.null(mult) && (mult == "first" || mult == "last")) {
221+
join_from = fdistinct(join_from, on=on, mult=mult, cols=from_cols, copy=FALSE)
222+
copy_r = nrow(join_from) != nrow(rhs) ## nrow(rhs) bc join_from=rhs
223+
} ## mult=="error" check was made on one side already, below we do on the second side, test 101.43
224+
225+
## binary merge anti join; only need to keep 'irows'
226+
mult = if (!is.null(mult) && mult != "all") mult
227+
supplement_rows = dtmerge(x=join_to, i=join_from, on=on, how="anti", mult=mult, verbose=verbose, join.many=join.many)$irows
228+
229+
## make anti join side
230+
cols_r = someCols(join_from, from_cols, keep=on)
231+
if (is.null(supplement_rows)) {
232+
out.r = .shallow(join_from, cols=cols_r, retain.key=TRUE) ## retain.key is used only in the edge case when !nrow(out.i)
233+
} else {
234+
out.r = .Call(CsubsetDT, join_from, supplement_rows, cols_r)
235+
copy_r = TRUE
236+
}
237+
238+
## short circuit to avoid rbindlist to empty sets and retains keys
239+
if (!nrow(out.r)) { ## possibly also !nrow(out.i)
240+
if (!copy_i && copy) out.i = copy(out.i)
241+
#if (!copy_x && copy) out.x = copy(out.x) ## as of now copy_x always TRUE, search for #4409 here
242+
out = .Call(Ccbindlist, list(out.i, out.x), FALSE)
243+
} else if (!nrow(out.i)) { ## but not !nrow(out.r)
244+
if (!copy_r && copy) out.r = copy(out.r)
245+
if (length(add <- setdiff(names(out.i), names(out.r)))) { ## add missing columns of proper types NA
246+
neworder = copy(names(out.i)) #set(out.r, NULL, add, lapply(unclass(out.i)[add], `[`, 1L)) ## 291.04 overalloc exceed fail during set()
247+
out.i = lapply(unclass(out.i)[add], `[`, seq_len(nrow(out.r))) ## could eventually remove this when cbindlist recycle 0 rows up, note that we need out.r not to be copied
248+
out.r = .Call(Ccbindlist, list(out.r, out.i), FALSE)
249+
setcolorder(out.r, neworder=neworder)
250+
}
251+
out = out.r
252+
} else { ## all might have not been copied yet, rbindlist will copy
253+
out.l = .Call(Ccbindlist, list(out.i, out.x), FALSE)
254+
out = rbindlist(list(out.l, out.r), use.names=TRUE, fill=TRUE)
255+
}
256+
}
257+
setDT(out)
258+
}
259+
146260
# Previously, we had a custom C implementation here, which is ~2x faster,
147261
# but this is fast enough we don't bother maintaining a new routine.
148262
# Hopefully in the future rep() can recognize the ALTREP and use that, too.

R/onLoad.R

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@
8484
"datatable.print.trunc.cols"="FALSE", # for print.data.table
8585
"datatable.show.indices"="FALSE", # for print.data.table
8686
"datatable.allow.cartesian"="FALSE", # datatable.<argument name>
87+
"datatable.join.many"="TRUE", # mergelist, [.data.table #4383 #914
8788
"datatable.dfdispatchwarn"="TRUE", # not a function argument
8889
"datatable.warnredundantby"="TRUE", # not a function argument
8990
"datatable.alloccol"="1024L", # argument 'n' of alloc.col. Over-allocate 1024 spare column slots

0 commit comments

Comments
 (0)