2222 \item {join.many }{ \code {logical }, defaulting to \code {getOption(" datatable.join.many" )}, which is \code {TRUE } by default ; when \code {FALSE } and \code {mult = " all" }, an error is thrown when any \emph {many - to - many } matches are detected between pairs of tables. This is essentially a stricter version of the \code {allow.cartesian } option in \code {\link {[.data.table }}. Note that the option \code {" datatable.join.many" } also controls the behavior of joins in \code {[.data.table }. }
2323}
2424\details {
25- Function should be considered experimental. Users are encouraged to provide feedback in our issue tracker.
25+ Note : these functions should be considered experimental. Users are encouraged to provide feedback in our issue tracker.
2626
27- Merging is performed sequentially , for \code {l } of 3 tables , it will do something like \code {merge(merge(l [[1L ]], l [[2L ]]), l [[3L ]])}. Merging does not support \emph {non - equi joins }, column names to merge on must be common in both tables on each merge.
27+ Merging is performed sequentially from " left to right " , so that for \code {l } of 3 tables , it will do something like \code {merge(merge(l [[1L ]], l [[2L ]]), l [[3L ]])}. \emph {Non - equi joins } are not supported. Column names to merge on must be common in both tables on each merge.
2828
2929 Arguments \code {on }, \code {how }, \code {mult }, \code {join.many } could be lists as well , each of length \code {length(l )- 1L }, to provide argument to be used for each single tables pair to merge , see examples.
3030
31- Terms \emph {join - to } and \emph {join - from } depends on \code {how } argument :
31+ The terms \emph {join - to } and \emph {join - from } indicate which in a pair of tables is the " baseline" or " authoritative" source -- this governs the ordering of rows and columns.
32+ Whether each refers to the " left" or " right" table of a pair depends on the \code {how } argument :
3233 \enumerate {
33- \item { \code {how = " left| semi| anti" }: \emph {join - to } is \emph {RHS }, \emph {join - from } is \emph {LHS }. }
34- \item { \code {how = " inner| full| cross" }: treats \emph {LHS } and \emph {RHS } tables equally , terms applies to both tables . }
35- \item { \code {how = " right" }: \emph {join - to } is \emph {LHS }, \emph {join - from } is \emph {RHS }. }
34+ \item { \code {how \% in \% c( " left" , " semi" , " anti" ) }: \emph {join - to } is \emph {RHS }, \emph {join - from } is \emph {LHS }. }
35+ \item { \code {how \% in \% c( " inner" , " full" , " cross" ) }: \emph {LHS } and \emph {RHS } tables are treated equally , so that the terms are interchangeable . }
36+ \item { \code {how == " right" }: \emph {join - to } is \emph {LHS }, \emph {join - from } is \emph {RHS }. }
3637 }
3738
38- Using \code {mult = " error" } will raise exception when multiple rows in \emph {join - to } table match to the row in \emph {join - from } table. It should not be used to just detect duplicates , as duplicates might not have matching row , and in such case exception will not be raised .
39+ Using \code {mult = " error" } will throw an error when multiple rows in \emph {join - to } table match to the row in \emph {join - from } table. It should not be used just to detect duplicates , which might not have matching row , and thus would silently be missed .
3940
40- Default value for argument \code {mult } depends on \code {how } argument :
41+ When not specified , \code {mult } takes its default depending on the \code {how } argument :
4142 \enumerate {
42- \item { \code {how = " left| inner| full| right" } : sets \code {mult = " error" }. }
43- \item { \code {how = " semi| anti" } : sets \code {mult = " last" }, although works same as \code {mult = " first" }. }
44- \item { \code {how = " cross" }: sets \code {mult = " all" }. }
43+ \item { When \code {how \% in \% c( " left" , " inner" , " full" , " right" )}, \code {mult = " error" }. }
44+ \item { When \code {how \% in \% c( " semi" , " anti" )}, \code {mult = " last" }, although this is equivalent to \code {mult = " first" }. }
45+ \item { When \code {how == " cross" }, \code {mult = " all" }. }
4546 }
4647
47- When \code {on } argument is missing , then columns to join on will be decided based on \ emph { key } depending on \code {how } argument :
48+ When the \code {on } argument is missing , it will be determined based \code {how } argument :
4849 \enumerate {
49- \item { \code {how = " left| right| semi| anti" } : key columns of \emph {join - to } table. }
50- \item { \code {how = " inner| full" } : if only one table has key , then this key is used , if both tables have key , then \code {intersect(key(lhs ), key(rhs ))}, having its order aligned to shorter key. }
50+ \item { When \code {how \% in \% c( " left" , right " , " semi " , " anti " )}, \c ode{on} becomes the key columns of the \e mph{join-to} table. }
51+ \i tem{ When \c ode{how \% in \% c( " inner " , full" )}, if only one table has key , then this key is used ; if both tables have keys , then \code {on = intersect(key(lhs ), key(rhs ))}, having its order aligned to shorter key. }
5152 }
5253
53- When joining tables that are not directly linked to single table , e.g. snowflake schema , \emph {right } outer join can be used to optimize the sequence of merges , see examples .
54+ When joining tables that are not directly linked to a single table , e.g. a snowflake schema , a \emph {right } outer join can be used to optimize the sequence of merges , see Examples .
5455}
5556\value {
5657 A new \code {data.table } based on the merged objects.
5758
5859 For \code {setmergelist }, if possible , a \code {\link {copy }} of the inputs is avoided.
5960}
6061\note {
61- Using \code {how = " inner| full" } together with \code {mult != " all" } is sub - efficient. Unlike during join in \code {[.data.table }, it will apply \code {mult } on both tables. It is to ensure that the join is symmetric so \emph {LHS } and \emph {RHS } tables can be swapped , regardless of \code {mult } argument . It is always possible to apply \code {mult }- like filter manually and join using \code {mult = " all" }.
62+ Using \code {how = " inner" } or \ code { how = " full" } together with \code {mult != " all" } is sub - efficient. Unlike during joins in \code {[.data.table }, it will apply \code {mult } on both tables. This ensures that the join is symmetric so that the \emph {LHS } and \emph {RHS } tables can be swapped , regardless of \code {mult }. It is always possible to apply a \code {mult }- like filter manually and join using \code {mult = " all" }.
6263
63- Using \code {join.many = FALSE } is sub - efficient. Note that it only takes effect when \code {mult = " all" }. If input data are verified to not have duplicated matches , then this can safely use the default \code {TRUE }. Otherwise for \code {mult = " all" } merges it is recommended to use \code {join.many = FALSE }, unless of course \emph {many - to - many } join , that duplicates rows , is intended.
64+ Using \code {join.many = FALSE } is also sub - efficient. Note that it only takes effect when \code {mult = " all" }. If input data are verified not to have duplicate matches , then this can safely use the default \code {TRUE }. Otherwise , for \code {mult = " all" } merges it is recommended to use \code {join.many = FALSE }, unless of course \emph {many - to - many } joins , duplicating rows , are intended.
6465}
6566\seealso {
6667 \code {\link {[.data.table }}, \code {\link {merge.data.table }}
6768}
6869\examples {
6970l = list (
70- data.table(id1 = c(1 : 4 ,2 : 5 ), v1 = 1 : 8 ),
71- data.table(id1 = 2 : 3 , v2 = 1 : 2 ),
72- data.table(id1 = 3 : 5 , v3 = 1 : 3 )
71+ data.table(id1 = c(1 : 4 , 2 : 5 ), v1 = 1 : 8 ),
72+ data.table(id1 = 2 : 3 , v2 = 1 : 2 ),
73+ data.table(id1 = 3 : 5 , v3 = 1 : 3 )
7374)
7475mergelist(l , on = " id1" )
7576
7677# # using keys
7778l = list (
78- data.table(id1 = c(1 : 4 ,2 : 5 ), v1 = 1 : 8 ),
79- data.table(id1 = 3 : 5 , id2 = 1 : 3 , v2 = 1 : 3 , key = " id1" ),
80- data.table(id2 = 1 : 4 , v3 = 4 : 1 , key = " id2" )
79+ data.table(id1 = c(1 : 4 , 2 : 5 ), v1 = 1 : 8 ),
80+ data.table(id1 = 3 : 5 , id2 = 1 : 3 , v2 = 1 : 3 , key = " id1" ),
81+ data.table(id2 = 1 : 4 , v3 = 4 : 1 , key = " id2" )
8182)
8283mergelist(l )
8384
8485# # select columns
8586l = list (
86- data.table(id1 = c(1 : 4 ,2 : 5 ), v1 = 1 : 8 , v2 = 8 : 1 ),
87- data.table(id1 = 3 : 5 , v3 = 1 : 3 , v4 = 3 : 1 , v5 = 1L , key = " id1" )
87+ data.table(id1 = c(1 : 4 , 2 : 5 ), v1 = 1 : 8 , v2 = 8 : 1 ),
88+ data.table(id1 = 3 : 5 , v3 = 1 : 3 , v4 = 3 : 1 , v5 = 1L , key = " id1" )
8889)
89- mergelist(l , cols = list (NULL , c(" v3" ," v5" )))
90+ mergelist(l , cols = list (NULL , c(" v3" , " v5" )))
9091
9192# # different arguments for each merge pair
9293l = list (
9394 data.table(id1 = 1 : 4 , id2 = 4 : 1 ),
94- data.table(id1 = c(1 : 3 ,1 : 2 ), v2 = c(1L ,1L ,1 : 2 ,2L )),
95+ data.table(id1 = c(1 : 3 , 1 : 2 ), v2 = c(1L , 1L , 1 : 2 , 2L )),
9596 data.table(id2 = 4 : 5 )
9697)
9798mergelist(l ,
@@ -101,8 +102,8 @@ mergelist(l,
101102
102103# # detecting duplicates matches
103104l = list (
104- data.table(id1 = c(1 : 4 ,2 : 5 ), v1 = 1 : 8 ), # # dups in LHS are fine
105- data.table(id1 = c(2 : 3 ,2L ), v2 = 1 : 3 ), # # dups in RHS
105+ data.table(id1 = c(1 : 4 , 2 : 5 ), v1 = 1 : 8 ), # # dups in LHS are fine
106+ data.table(id1 = c(2 : 3 , 2L ), v2 = 1 : 3 ), # # dups in RHS
106107 data.table(id1 = 3 : 5 , v3 = 1 : 3 )
107108)
108109# mergelist(l, on="id1") # ERROR: mult='error' and multiple matches during merge
@@ -112,30 +113,31 @@ lapply(l[-1L], `[`, j = if (.N>1L) .SD, by = "id1") ## duplicated rows
112113
113114# ## populate fact: US population by state and date
114115
115- gt = state.x77 [," Population" ]
116- gt = data.table(state_id = seq_along(state.name ), p = gt [state.name ]/ sum(gt ), k = 1L )
117- tt = as.IDate(paste0(as.integer(time(uspop ))," -01-01" ))
116+ gt = state.x77 [, " Population" ]
117+ gt = data.table(state_id = seq_along(state.name ), p = gt [state.name ] / sum(gt ), k = 1L )
118+ tt = as.IDate(paste0(as.integer(time(uspop )), " -01-01" ))
118119tt = as.data.table(stats :: approx(tt , c(uspop ), tt [1L ]: tt [length(tt )]))
119120tt = tt [, .(date = as.IDate(x ), date_id = seq_along(x ), pop = y , k = 1L )]
120121fact = tt [gt , on = " k" , allow.cartesian = TRUE ,
121- .(state_id = i.state_id , date_id = x.date_id , population = x.pop * i.p )]
122- setkeyv(fact , c(" state_id" ," date_id" ))
122+ .(state_id = i.state_id , date_id = x.date_id , population = x.pop * i.p )]
123+ setkeyv(fact , c(" state_id" , " date_id" ))
123124
124125# ## populate dimensions: time and geography
125126
126- time = data.table(key = " date_id" ,
127- date_id = seq_along(tt $ date ), date = tt $ date ,
128- month_id = month(tt $ date ), month = month.name [month(tt $ date )],
129- year_id = year(tt $ date )- 1789L , year = as.character(year(tt $ date )),
130- week_id = week(tt $ date ), week = as.character(week(tt $ date )),
131- weekday_id = wday(tt $ date )- 1L , weekday = weekdays(tt $ date )
132- )[weekday_id == 0L , weekday_id : = 7L ][]
133- geog = data.table(key = " state_id" ,
134- state_id = seq_along(state.name ), state_abb = state.abb , state_name = state.name ,
135- division_id = as.integer(state.division ),
136- division_name = as.character(state.division ),
137- region_id = as.integer(state.region ),
138- region_name = as.character(state.region )
127+ time = data.table(key = " date_id" ,
128+ date_id = seq_along(tt $ date ), date = tt $ date ,
129+ month_id = month(tt $ date ), month = month.name [month(tt $ date )],
130+ year_id = year(tt $ date )- 1789L , year = as.character(year(tt $ date )),
131+ week_id = week(tt $ date ), week = as.character(week(tt $ date )),
132+ weekday_id = wday(tt $ date )- 1L , weekday = weekdays(tt $ date )
133+ )
134+ time [weekday_id == 0L , weekday_id : = 7L ][]
135+ geog = data.table(key = " state_id" ,
136+ state_id = seq_along(state.name ), state_abb = state.abb , state_name = state.name ,
137+ division_id = as.integer(state.division ),
138+ division_name = as.character(state.division ),
139+ region_id = as.integer(state.region ),
140+ region_name = as.character(state.region )
139141)
140142rm(gt , tt )
141143
@@ -155,17 +157,17 @@ make.lvl = function(x, cols) {
155157 setindexv(lvl , as.list(cols ))
156158}
157159time = list (
158- date = make.lvl(time , c( " date_id " , " date " , " year_id " , " month_id " , " week_id " ,
159- " weekday_id" )),
160- weekday = make.lvl(time , c(" weekday_id" ," weekday" )),
161- week = make.lvl(time , c(" week_id" ," week" )),
162- month = make.lvl(time , c(" month_id" ," month" )),
163- year = make.lvl(time , c(" year_id" ," year" ))
160+ date = make.lvl(
161+ time , c( " date_id " , " date " , " year_id " , " month_id " , " week_id " , " weekday_id" )),
162+ weekday = make.lvl(time , c(" weekday_id" , " weekday" )),
163+ week = make.lvl(time , c(" week_id" , " week" )),
164+ month = make.lvl(time , c(" month_id" , " month" )),
165+ year = make.lvl(time , c(" year_id" , " year" ))
164166)
165167geog = list (
166- state = make.lvl(geog , c(" state_id" ," state_abb" ," state_name" ," division_id" )),
167- division = make.lvl(geog , c(" division_id" ," division_name" ," region_id" )),
168- region = make.lvl(geog , c(" region_id" ," region_name" ))
168+ state = make.lvl(geog , c(" state_id" , " state_abb" , " state_name" , " division_id" )),
169+ division = make.lvl(geog , c(" division_id" , " division_name" , " region_id" )),
170+ region = make.lvl(geog , c(" region_id" , " region_name" ))
169171)
170172
171173# ## denormalize 'snowflake schema'
0 commit comments