You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
+15-40Lines changed: 15 additions & 40 deletions
Original file line number
Diff line number
Diff line change
@@ -191,64 +191,39 @@ flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]
191
191
192
192
* Since the time to compute the secondary index is quite small, we don't have to use `setindex()`, unless, once again, the task involves repeated subsetting on the same column.
193
193
194
-
### b) Using named list elements in `i`
195
-
When subsetting using the `on` argument, values in `i` are typically passed as unnamed elements. However, naming elements explicitly in `i` improves readability, especially when dealing with multiple keys.
196
-
197
-
- Example: Standard subsetting using unnamed elements
198
-
```{r}
199
-
flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
200
-
```
201
-
While this syntax is concise, it may not be immediately clear which value corresponds to which key in `on`.
202
-
203
-
- Subsetting using named elements in `i`
194
+
* For clarity/readability, it might help to name the inputs in `i`, e.g.,
204
195
```{r}
205
-
flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
196
+
flights[.(origin = "JFK", dest = "LAX"), on = c("origin", "dest")]
206
197
```
207
-
Naming elements explicitly `(origin = "LGA", dest = "TPA")` clarifies variable correspondence.
198
+
This makes it clear which values correspond to which key.
208
199
209
-
- Using named lists with multiple values
210
-
When multiple values are passed, named elements further enhance clarity:
211
-
```{r unnamed_elemts}
212
-
flights[.("LGA", "JFK", "EWR"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
213
-
```
214
-
215
-
```{r named_elements}
216
-
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
217
-
```
218
-
- Impact of named elements on key order
219
-
```{r}
220
-
flights[.(dest = "TPA", origin = "LGA"), on = .(origin, dest)]
221
-
```
222
-
- When to use named list elements in `i`.
223
-
when working with multiple keys in `on`, as it improves readability.
224
-
225
-
### c) Select in `j`
200
+
### b) Select in `j`
226
201
227
202
All the operations we will discuss below are no different to the ones we already saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. Except we'll be using the `on` argument instead of setting keys.
228
203
229
204
#### -- Return `arr_delay` column alone as a data.table corresponding to `origin = "LGA"` and `dest = "TPA"`
230
205
231
206
```{r}
232
-
flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")]
207
+
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
233
208
```
234
209
235
-
### d) Chaining
210
+
### c) Chaining
236
211
237
212
#### -- On the result obtained above, use chaining to order the column in decreasing order.
238
213
239
214
```{r}
240
-
flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
215
+
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
241
216
```
242
217
243
-
### e) Compute or *do* in `j`
218
+
### d) Compute or *do* in `j`
244
219
245
220
#### -- Find the maximum arrival delay corresponding to `origin = "LGA"` and `dest = "TPA"`.
246
221
247
222
```{r}
248
-
flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
223
+
flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
249
224
```
250
225
251
-
### f) *sub-assign* by reference using `:=` in `j`
226
+
### e) *sub-assign* by reference using `:=` in `j`
252
227
253
228
We have seen this example already in the vignettes [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) and [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html). Let's take a look at all the `hours` available in the `flights`*data.table*:
254
229
@@ -271,7 +246,7 @@ flights[, sort(unique(hour))]
271
246
272
247
* This is particularly a huge advantage of secondary indices. Previously, just to update a few rows of `hour`, we had to `setkey()` on it, which inevitably reorders the entire data.table. With `on`, the order is preserved, and the operation is much faster! Looking at the code, the task we wanted to perform is also quite clear.
273
248
274
-
### g) Aggregation using `by`
249
+
### f) Aggregation using `by`
275
250
276
251
#### -- Get the maximum departure delay for each `month` corresponding to `origin = "JFK"`. Order the result by `month`
277
252
@@ -282,7 +257,7 @@ head(ans)
282
257
283
258
* We would have had to set the `key` back to `origin, dest` again, if we did not use `on` which internally builds secondary indices on the fly.
284
259
285
-
### h) The *mult* argument
260
+
### g) The *mult* argument
286
261
287
262
The other arguments including `mult` work exactly the same way as we saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. The default value for `mult` is "all". We can choose, instead only the "first" or "last" matching rows should be returned.
288
263
@@ -295,17 +270,17 @@ flights[c("BOS", "DAY"), on = "dest", mult = "first"]
295
270
#### -- Subset only the last matching row where `origin` matches *"LGA", "JFK", "EWR"* and `dest` matches *"XNA"*
296
271
297
272
```{r}
298
-
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), on = c("origin", "dest"), mult = "last"]
273
+
flights[.(c("LGA", "JFK", "EWR"), "XNA"), on = c("origin", "dest"), mult = "last"]
299
274
```
300
275
301
-
### i) The *nomatch* argument
276
+
### h) The *nomatch* argument
302
277
303
278
We can choose if queries that do not match should return `NA` or be skipped altogether using the `nomatch` argument.
304
279
305
280
#### -- From the previous example, subset all rows only if there's a match
306
281
307
282
```{r}
308
-
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
283
+
flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
309
284
```
310
285
311
286
* There are no flights connecting "JFK" and "XNA". Therefore, that row is skipped in the result.
0 commit comments