Skip to content

Commit 4a416fc

Browse files
committed
added examples
1 parent 5c964b3 commit 4a416fc

File tree

2 files changed

+76
-12
lines changed

2 files changed

+76
-12
lines changed

vignettes/datatable-joins.Rmd

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,40 @@ Products[ProductReceived,
194194
on = .(id = product_id)]
195195
```
196196

197+
#### 3.1.2. Using Named Lists for Explicit Joins
198+
In `data.table`, joins can be performed using unnamed lists `(list())` or named lists. Named lists provide greater clarity and reduce ambiguity when matching column names, especially when joining on multiple columns.
199+
200+
- Using Unnamed
201+
```{r}
202+
dt1 <- data.table(id = c(1, 2, 3), value = c("A", "B", "C"))
203+
dt2 <- data.table(id = c(2, 3, 4), info = c("X", "Y", "Z"))
204+
dt1[dt2, on = "id"]
205+
```
206+
- Using a named list for explicit joins:
207+
```{r}
208+
dt1[dt2, on = .(id = id)]
209+
```
210+
- Here, .() is a shorthand for list(), and explicitly naming the column (id = id) makes the join easier to understand.
211+
212+
##### Named Lists for Multiple Column Joins
213+
When joining on multiple columns, named lists prevent mismatches and make the query more readable:
214+
```{r}
215+
dt1 <- data.table(id = c(1, 2, 3), key1 = c("A", "B", "C"), value = c(10, 20, 30))
216+
dt2 <- data.table(id = c(2, 3, 4), key1 = c("B", "C", "D"), info = c("X", "Y", "Z"))
217+
218+
# Unnamed list approach (less readable)
219+
dt1[dt2, on = c("id", "key1")]
220+
221+
# Named list approach (explicit and clear)
222+
dt1[dt2, on = .(id = id, key1 = key1)]
223+
```
224+
This ensures that column names are explicitly matched, which is especially useful when working with complex datasets.
225+
226+
- When Should You Use Named Lists?
227+
There is potential ambiguity in column names.
228+
You are joining on multiple columns.
229+
You want to make your joins self-documenting and more readable.
230+
197231
#### 3.1.2. Alternatives to define the `on` argument
198232

199233
In all the prior example we have pass the column names we want to match to the `on` argument but `data.table` also have alternatives to that syntax.

vignettes/datatable-secondary-indices-and-auto-indexing.Rmd

Lines changed: 42 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -191,33 +191,63 @@ flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]
191191

192192
* Since the time to compute the secondary index is quite small, we don't have to use `setindex()`, unless, once again, the task involves repeated subsetting on the same column.
193193

194-
### b) Select in `j`
194+
### b) Using named list elements in `i`
195+
When subsetting using the on argument, values in `i` are typically passed as unnamed elements. However, naming elements explicitly in `i` improves readability, especially when dealing with multiple keys.
196+
197+
- Example: Standard subsetting using unnamed elements
198+
```{r}
199+
flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
200+
```
201+
While this syntax is concise, it may not be immediately clear which value corresponds to which key in `on`.
202+
203+
- Subsetting using named elements in `i`
204+
```{r}
205+
flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
206+
```
207+
Here, naming the elements explicitly `(origin = "LGA", dest = "TPA")` makes it clear which variable each value corresponds to. This improves code maintainability, especially in complex queries.
208+
209+
- Using named lists with multiple values
210+
When multiple values are passed, named elements further enhance clarity:
211+
```{r}
212+
flights[.("LGA", "JFK", "EWR"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
213+
```
214+
- Named elements
215+
```{r}
216+
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
217+
```
218+
- Impact of named elements on key order
219+
It's important to note that naming elements in `i` only affects ordering when `on` is specified. If `on` is not used, data.table will match values based on key order, regardless of the names used.
220+
221+
- When to use named list elements in `i`.
222+
when working with multiple keys in `on`, as it improves readability.
223+
224+
### c) Select in `j`
195225

196226
All the operations we will discuss below are no different to the ones we already saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. Except we'll be using the `on` argument instead of setting keys.
197227

198228
#### -- Return `arr_delay` column alone as a data.table corresponding to `origin = "LGA"` and `dest = "TPA"`
199229

200230
```{r}
201-
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
231+
flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")]
202232
```
203233

204-
### c) Chaining
234+
### d) Chaining
205235

206236
#### -- On the result obtained above, use chaining to order the column in decreasing order.
207237

208238
```{r}
209-
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
239+
flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
210240
```
211241

212-
### d) Compute or *do* in `j`
242+
### e) Compute or *do* in `j`
213243

214244
#### -- Find the maximum arrival delay corresponding to `origin = "LGA"` and `dest = "TPA"`.
215245

216246
```{r}
217-
flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
247+
flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
218248
```
219249

220-
### e) *sub-assign* by reference using `:=` in `j`
250+
### f) *sub-assign* by reference using `:=` in `j`
221251

222252
We have seen this example already in the vignettes [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) and [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html). Let's take a look at all the `hours` available in the `flights` *data.table*:
223253

@@ -240,7 +270,7 @@ flights[, sort(unique(hour))]
240270

241271
* This is particularly a huge advantage of secondary indices. Previously, just to update a few rows of `hour`, we had to `setkey()` on it, which inevitably reorders the entire data.table. With `on`, the order is preserved, and the operation is much faster! Looking at the code, the task we wanted to perform is also quite clear.
242272

243-
### f) Aggregation using `by`
273+
### g) Aggregation using `by`
244274

245275
#### -- Get the maximum departure delay for each `month` corresponding to `origin = "JFK"`. Order the result by `month`
246276

@@ -251,7 +281,7 @@ head(ans)
251281

252282
* We would have had to set the `key` back to `origin, dest` again, if we did not use `on` which internally builds secondary indices on the fly.
253283

254-
### g) The *mult* argument
284+
### h) The *mult* argument
255285

256286
The other arguments including `mult` work exactly the same way as we saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. The default value for `mult` is "all". We can choose, instead only the "first" or "last" matching rows should be returned.
257287

@@ -264,17 +294,17 @@ flights[c("BOS", "DAY"), on = "dest", mult = "first"]
264294
#### -- Subset only the last matching row where `origin` matches *"LGA", "JFK", "EWR"* and `dest` matches *"XNA"*
265295

266296
```{r}
267-
flights[.(c("LGA", "JFK", "EWR"), "XNA"), on = c("origin", "dest"), mult = "last"]
297+
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), on = c("origin", "dest"), mult = "last"]
268298
```
269299

270-
### h) The *nomatch* argument
300+
### i) The *nomatch* argument
271301

272302
We can choose if queries that do not match should return `NA` or be skipped altogether using the `nomatch` argument.
273303

274304
#### -- From the previous example, subset all rows only if there's a match
275305

276306
```{r}
277-
flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
307+
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
278308
```
279309

280310
* There are no flights connecting "JFK" and "XNA". Therefore, that row is skipped in the result.

0 commit comments

Comments
 (0)