You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/datatable-joins.Rmd
+34Lines changed: 34 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -194,6 +194,40 @@ Products[ProductReceived,
194
194
on = .(id = product_id)]
195
195
```
196
196
197
+
#### 3.1.2. Using Named Lists for Explicit Joins
198
+
In `data.table`, joins can be performed using unnamed lists `(list())` or named lists. Named lists provide greater clarity and reduce ambiguity when matching column names, especially when joining on multiple columns.
Copy file name to clipboardExpand all lines: vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
+42-12Lines changed: 42 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -191,33 +191,63 @@ flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]
191
191
192
192
* Since the time to compute the secondary index is quite small, we don't have to use `setindex()`, unless, once again, the task involves repeated subsetting on the same column.
193
193
194
-
### b) Select in `j`
194
+
### b) Using named list elements in `i`
195
+
When subsetting using the on argument, values in `i` are typically passed as unnamed elements. However, naming elements explicitly in `i` improves readability, especially when dealing with multiple keys.
196
+
197
+
- Example: Standard subsetting using unnamed elements
198
+
```{r}
199
+
flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
200
+
```
201
+
While this syntax is concise, it may not be immediately clear which value corresponds to which key in `on`.
202
+
203
+
- Subsetting using named elements in `i`
204
+
```{r}
205
+
flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
206
+
```
207
+
Here, naming the elements explicitly `(origin = "LGA", dest = "TPA")` makes it clear which variable each value corresponds to. This improves code maintainability, especially in complex queries.
208
+
209
+
- Using named lists with multiple values
210
+
When multiple values are passed, named elements further enhance clarity:
211
+
```{r}
212
+
flights[.("LGA", "JFK", "EWR"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
213
+
```
214
+
- Named elements
215
+
```{r}
216
+
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
217
+
```
218
+
- Impact of named elements on key order
219
+
It's important to note that naming elements in `i` only affects ordering when `on` is specified. If `on` is not used, data.table will match values based on key order, regardless of the names used.
220
+
221
+
- When to use named list elements in `i`.
222
+
when working with multiple keys in `on`, as it improves readability.
223
+
224
+
### c) Select in `j`
195
225
196
226
All the operations we will discuss below are no different to the ones we already saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. Except we'll be using the `on` argument instead of setting keys.
197
227
198
228
#### -- Return `arr_delay` column alone as a data.table corresponding to `origin = "LGA"` and `dest = "TPA"`
199
229
200
230
```{r}
201
-
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
231
+
flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")]
202
232
```
203
233
204
-
### c) Chaining
234
+
### d) Chaining
205
235
206
236
#### -- On the result obtained above, use chaining to order the column in decreasing order.
207
237
208
238
```{r}
209
-
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
239
+
flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
210
240
```
211
241
212
-
### d) Compute or *do* in `j`
242
+
### e) Compute or *do* in `j`
213
243
214
244
#### -- Find the maximum arrival delay corresponding to `origin = "LGA"` and `dest = "TPA"`.
215
245
216
246
```{r}
217
-
flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
247
+
flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
218
248
```
219
249
220
-
### e) *sub-assign* by reference using `:=` in `j`
250
+
### f) *sub-assign* by reference using `:=` in `j`
221
251
222
252
We have seen this example already in the vignettes [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) and [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html). Let's take a look at all the `hours` available in the `flights`*data.table*:
223
253
@@ -240,7 +270,7 @@ flights[, sort(unique(hour))]
240
270
241
271
* This is particularly a huge advantage of secondary indices. Previously, just to update a few rows of `hour`, we had to `setkey()` on it, which inevitably reorders the entire data.table. With `on`, the order is preserved, and the operation is much faster! Looking at the code, the task we wanted to perform is also quite clear.
242
272
243
-
### f) Aggregation using `by`
273
+
### g) Aggregation using `by`
244
274
245
275
#### -- Get the maximum departure delay for each `month` corresponding to `origin = "JFK"`. Order the result by `month`
246
276
@@ -251,7 +281,7 @@ head(ans)
251
281
252
282
* We would have had to set the `key` back to `origin, dest` again, if we did not use `on` which internally builds secondary indices on the fly.
253
283
254
-
### g) The *mult* argument
284
+
### h) The *mult* argument
255
285
256
286
The other arguments including `mult` work exactly the same way as we saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. The default value for `mult` is "all". We can choose, instead only the "first" or "last" matching rows should be returned.
257
287
@@ -264,17 +294,17 @@ flights[c("BOS", "DAY"), on = "dest", mult = "first"]
264
294
#### -- Subset only the last matching row where `origin` matches *"LGA", "JFK", "EWR"* and `dest` matches *"XNA"*
265
295
266
296
```{r}
267
-
flights[.(c("LGA", "JFK", "EWR"), "XNA"), on = c("origin", "dest"), mult = "last"]
297
+
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), on = c("origin", "dest"), mult = "last"]
268
298
```
269
299
270
-
### h) The *nomatch* argument
300
+
### i) The *nomatch* argument
271
301
272
302
We can choose if queries that do not match should return `NA` or be skipped altogether using the `nomatch` argument.
273
303
274
304
#### -- From the previous example, subset all rows only if there's a match
275
305
276
306
```{r}
277
-
flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
307
+
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
278
308
```
279
309
280
310
* There are no flights connecting "JFK" and "XNA". Therefore, that row is skipped in the result.
0 commit comments