Skip to content

Commit 80f23d8

Browse files
finish style+grammar pass
1 parent 24f4549 commit 80f23d8

File tree

1 file changed

+58
-56
lines changed

1 file changed

+58
-56
lines changed

man/mergelist.Rd

Lines changed: 58 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -22,76 +22,77 @@
2222
\item{join.many}{ \code{logical}, defaulting to \code{getOption("datatable.join.many")}, which is \code{TRUE} by default; when \code{FALSE} and \code{mult="all"}, an error is thrown when any \emph{many-to-many} matches are detected between pairs of tables. This is essentially a stricter version of the \code{allow.cartesian} option in \code{\link{[.data.table}}. Note that the option \code{"datatable.join.many"} also controls the behavior of joins in \code{[.data.table}. }
2323
}
2424
\details{
25-
Function should be considered experimental. Users are encouraged to provide feedback in our issue tracker.
25+
Note: these functions should be considered experimental. Users are encouraged to provide feedback in our issue tracker.
2626

27-
Merging is performed sequentially, for \code{l} of 3 tables, it will do something like \code{merge(merge(l[[1L]], l[[2L]]), l[[3L]])}. Merging does not support \emph{non-equi joins}, column names to merge on must be common in both tables on each merge.
27+
Merging is performed sequentially from "left to right", so that for \code{l} of 3 tables, it will do something like \code{merge(merge(l[[1L]], l[[2L]]), l[[3L]])}. \emph{Non-equi joins} are not supported. Column names to merge on must be common in both tables on each merge.
2828

2929
Arguments \code{on}, \code{how}, \code{mult}, \code{join.many} could be lists as well, each of length \code{length(l)-1L}, to provide argument to be used for each single tables pair to merge, see examples.
3030

31-
Terms \emph{join-to} and \emph{join-from} depends on \code{how} argument:
31+
The terms \emph{join-to} and \emph{join-from} indicate which in a pair of tables is the "baseline" or "authoritative" source -- this governs the ordering of rows and columns.
32+
Whether each refers to the "left" or "right" table of a pair depends on the \code{how} argument:
3233
\enumerate{
33-
\item{ \code{how="left|semi|anti"}: \emph{join-to} is \emph{RHS}, \emph{join-from} is \emph{LHS}. }
34-
\item{ \code{how="inner|full|cross"}: treats \emph{LHS} and \emph{RHS} tables equally, terms applies to both tables. }
35-
\item{ \code{how="right"}: \emph{join-to} is \emph{LHS}, \emph{join-from} is \emph{RHS}. }
34+
\item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-to} is \emph{RHS}, \emph{join-from} is \emph{LHS}. }
35+
\item{ \code{how \%in\% c("inner", "full", "cross")}: \emph{LHS} and \emph{RHS} tables are treated equally, so that the terms are interchangeable. }
36+
\item{ \code{how == "right"}: \emph{join-to} is \emph{LHS}, \emph{join-from} is \emph{RHS}. }
3637
}
3738

38-
Using \code{mult="error"} will raise exception when multiple rows in \emph{join-to} table match to the row in \emph{join-from} table. It should not be used to just detect duplicates, as duplicates might not have matching row, and in such case exception will not be raised.
39+
Using \code{mult="error"} will throw an error when multiple rows in \emph{join-to} table match to the row in \emph{join-from} table. It should not be used just to detect duplicates, which might not have matching row, and thus would silently be missed.
3940

40-
Default value for argument \code{mult} depends on \code{how} argument:
41+
When not specified, \code{mult} takes its default depending on the \code{how} argument:
4142
\enumerate{
42-
\item{ \code{how="left|inner|full|right"}: sets \code{mult="error"}. }
43-
\item{ \code{how="semi|anti"}: sets \code{mult="last"}, although works same as \code{mult="first"}. }
44-
\item{ \code{how="cross"}: sets \code{mult="all"}. }
43+
\item{ When \code{how \%in\% c("left", "inner", "full", "right")}, \code{mult="error"}. }
44+
\item{ When \code{how \%in\% c("semi", "anti")}, \code{mult="last"}, although this is equivalent to \code{mult="first"}. }
45+
\item{ When \code{how == "cross"}, \code{mult="all"}. }
4546
}
4647

47-
When \code{on} argument is missing, then columns to join on will be decided based on \emph{key} depending on \code{how} argument:
48+
When the \code{on} argument is missing, it will be determined based \code{how} argument:
4849
\enumerate{
49-
\item{ \code{how="left|right|semi|anti"}: key columns of \emph{join-to} table. }
50-
\item{ \code{how="inner|full"}: if only one table has key, then this key is used, if both tables have key, then \code{intersect(key(lhs), key(rhs))}, having its order aligned to shorter key. }
50+
\item{ When \code{how \%in\% c("left", right", "semi", "anti")}, \code{on} becomes the key columns of the \emph{join-to} table. }
51+
\item{ When \code{how \%in\% c("inner", full")}, if only one table has key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to shorter key. }
5152
}
5253

53-
When joining tables that are not directly linked to single table, e.g. snowflake schema, \emph{right} outer join can be used to optimize the sequence of merges, see examples.
54+
When joining tables that are not directly linked to a single table, e.g. a snowflake schema, a \emph{right} outer join can be used to optimize the sequence of merges, see Examples.
5455
}
5556
\value{
5657
A new \code{data.table} based on the merged objects.
5758

5859
For \code{setmergelist}, if possible, a \code{\link{copy}} of the inputs is avoided.
5960
}
6061
\note{
61-
Using \code{how="inner|full"} together with \code{mult!="all"} is sub-efficient. Unlike during join in \code{[.data.table}, it will apply \code{mult} on both tables. It is to ensure that the join is symmetric so \emph{LHS} and \emph{RHS} tables can be swapped, regardless of \code{mult} argument. It is always possible to apply \code{mult}-like filter manually and join using \code{mult="all"}.
62+
Using \code{how="inner"} or \code{how="full"} together with \code{mult!="all"} is sub-efficient. Unlike during joins in \code{[.data.table}, it will apply \code{mult} on both tables. This ensures that the join is symmetric so that the \emph{LHS} and \emph{RHS} tables can be swapped, regardless of \code{mult}. It is always possible to apply a \code{mult}-like filter manually and join using \code{mult="all"}.
6263

63-
Using \code{join.many=FALSE} is sub-efficient. Note that it only takes effect when \code{mult="all"}. If input data are verified to not have duplicated matches, then this can safely use the default \code{TRUE}. Otherwise for \code{mult="all"} merges it is recommended to use \code{join.many=FALSE}, unless of course \emph{many-to-many} join, that duplicates rows, is intended.
64+
Using \code{join.many=FALSE} is also sub-efficient. Note that it only takes effect when \code{mult="all"}. If input data are verified not to have duplicate matches, then this can safely use the default \code{TRUE}. Otherwise, for \code{mult="all"} merges it is recommended to use \code{join.many=FALSE}, unless of course \emph{many-to-many} joins, duplicating rows, are intended.
6465
}
6566
\seealso{
6667
\code{\link{[.data.table}}, \code{\link{merge.data.table}}
6768
}
6869
\examples{
6970
l = list(
70-
data.table(id1 = c(1:4,2:5), v1 = 1:8),
71-
data.table(id1 = 2:3, v2 = 1:2),
72-
data.table(id1 = 3:5, v3 = 1:3)
71+
data.table(id1=c(1:4, 2:5), v1=1:8),
72+
data.table(id1=2:3, v2=1:2),
73+
data.table(id1=3:5, v3=1:3)
7374
)
7475
mergelist(l, on="id1")
7576

7677
## using keys
7778
l = list(
78-
data.table(id1 = c(1:4,2:5), v1 = 1:8),
79-
data.table(id1 = 3:5, id2 = 1:3, v2 = 1:3, key="id1"),
80-
data.table(id2 = 1:4, v3 = 4:1, key="id2")
79+
data.table(id1=c(1:4, 2:5), v1=1:8),
80+
data.table(id1=3:5, id2=1:3, v2=1:3, key="id1"),
81+
data.table(id2=1:4, v3=4:1, key="id2")
8182
)
8283
mergelist(l)
8384

8485
## select columns
8586
l = list(
86-
data.table(id1 = c(1:4,2:5), v1 = 1:8, v2 = 8:1),
87-
data.table(id1 = 3:5, v3 = 1:3, v4 = 3:1, v5 = 1L, key="id1")
87+
data.table(id1=c(1:4, 2:5), v1=1:8, v2=8:1),
88+
data.table(id1=3:5, v3=1:3, v4=3:1, v5=1L, key="id1")
8889
)
89-
mergelist(l, cols = list(NULL, c("v3","v5")))
90+
mergelist(l, cols=list(NULL, c("v3", "v5")))
9091

9192
## different arguments for each merge pair
9293
l = list(
9394
data.table(id1=1:4, id2=4:1),
94-
data.table(id1=c(1:3,1:2), v2=c(1L,1L,1:2,2L)),
95+
data.table(id1=c(1:3, 1:2), v2=c(1L, 1L, 1:2, 2L)),
9596
data.table(id2=4:5)
9697
)
9798
mergelist(l,
@@ -101,8 +102,8 @@ mergelist(l,
101102

102103
## detecting duplicates matches
103104
l = list(
104-
data.table(id1=c(1:4,2:5), v1=1:8), ## dups in LHS are fine
105-
data.table(id1=c(2:3,2L), v2=1:3), ## dups in RHS
105+
data.table(id1=c(1:4, 2:5), v1=1:8), ## dups in LHS are fine
106+
data.table(id1=c(2:3, 2L), v2=1:3), ## dups in RHS
106107
data.table(id1=3:5, v3=1:3)
107108
)
108109
#mergelist(l, on="id1") # ERROR: mult='error' and multiple matches during merge
@@ -112,30 +113,31 @@ lapply(l[-1L], `[`, j = if (.N>1L) .SD, by = "id1") ## duplicated rows
112113

113114
### populate fact: US population by state and date
114115

115-
gt = state.x77[,"Population"]
116-
gt = data.table(state_id=seq_along(state.name), p=gt[state.name]/sum(gt), k=1L)
117-
tt = as.IDate(paste0(as.integer(time(uspop)),"-01-01"))
116+
gt = state.x77[, "Population"]
117+
gt = data.table(state_id=seq_along(state.name), p=gt[state.name] / sum(gt), k=1L)
118+
tt = as.IDate(paste0(as.integer(time(uspop)), "-01-01"))
118119
tt = as.data.table(stats::approx(tt, c(uspop), tt[1L]:tt[length(tt)]))
119120
tt = tt[, .(date=as.IDate(x), date_id=seq_along(x), pop=y, k=1L)]
120121
fact = tt[gt, on="k", allow.cartesian=TRUE,
121-
.(state_id=i.state_id, date_id=x.date_id, population = x.pop * i.p)]
122-
setkeyv(fact, c("state_id","date_id"))
122+
.(state_id=i.state_id, date_id=x.date_id, population=x.pop * i.p)]
123+
setkeyv(fact, c("state_id", "date_id"))
123124

124125
### populate dimensions: time and geography
125126

126-
time = data.table(key = "date_id",
127-
date_id = seq_along(tt$date), date = tt$date,
128-
month_id = month(tt$date), month = month.name[month(tt$date)],
129-
year_id = year(tt$date)-1789L, year = as.character(year(tt$date)),
130-
week_id = week(tt$date), week = as.character(week(tt$date)),
131-
weekday_id = wday(tt$date)-1L, weekday = weekdays(tt$date)
132-
)[weekday_id==0L, weekday_id:=7L][]
133-
geog = data.table(key = "state_id",
134-
state_id = seq_along(state.name), state_abb=state.abb, state_name=state.name,
135-
division_id = as.integer(state.division),
136-
division_name = as.character(state.division),
137-
region_id = as.integer(state.region),
138-
region_name = as.character(state.region)
127+
time = data.table(key="date_id",
128+
date_id= seq_along(tt$date), date=tt$date,
129+
month_id=month(tt$date), month=month.name[month(tt$date)],
130+
year_id=year(tt$date)-1789L, year=as.character(year(tt$date)),
131+
week_id=week(tt$date), week=as.character(week(tt$date)),
132+
weekday_id=wday(tt$date)-1L, weekday=weekdays(tt$date)
133+
)
134+
time[weekday_id == 0L, weekday_id := 7L][]
135+
geog = data.table(key="state_id",
136+
state_id=seq_along(state.name), state_abb=state.abb, state_name=state.name,
137+
division_id=as.integer(state.division),
138+
division_name=as.character(state.division),
139+
region_id=as.integer(state.region),
140+
region_name=as.character(state.region)
139141
)
140142
rm(gt, tt)
141143

@@ -155,17 +157,17 @@ make.lvl = function(x, cols) {
155157
setindexv(lvl, as.list(cols))
156158
}
157159
time = list(
158-
date = make.lvl(time, c("date_id","date","year_id","month_id","week_id",
159-
"weekday_id")),
160-
weekday = make.lvl(time, c("weekday_id","weekday")),
161-
week = make.lvl(time, c("week_id","week")),
162-
month = make.lvl(time, c("month_id","month")),
163-
year = make.lvl(time, c("year_id","year"))
160+
date = make.lvl(
161+
time, c("date_id", "date", "year_id", "month_id", "week_id", "weekday_id")),
162+
weekday = make.lvl(time, c("weekday_id", "weekday")),
163+
week = make.lvl(time, c("week_id", "week")),
164+
month = make.lvl(time, c("month_id", "month")),
165+
year = make.lvl(time, c("year_id", "year"))
164166
)
165167
geog = list(
166-
state = make.lvl(geog, c("state_id","state_abb","state_name","division_id")),
167-
division = make.lvl(geog, c("division_id","division_name","region_id")),
168-
region = make.lvl(geog, c("region_id","region_name"))
168+
state = make.lvl(geog, c("state_id", "state_abb", "state_name", "division_id")),
169+
division = make.lvl(geog, c("division_id", "division_name", "region_id")),
170+
region = make.lvl(geog, c("region_id", "region_name"))
169171
)
170172

171173
### denormalize 'snowflake schema'

0 commit comments

Comments
 (0)