Skip to content

Commit a8a03e9

Browse files
committed
seggested improvements
1 parent b76dc29 commit a8a03e9

File tree

1 file changed

+15
-40
lines changed

1 file changed

+15
-40
lines changed

vignettes/datatable-secondary-indices-and-auto-indexing.Rmd

Lines changed: 15 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -191,64 +191,39 @@ flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]
191191

192192
* Since the time to compute the secondary index is quite small, we don't have to use `setindex()`, unless, once again, the task involves repeated subsetting on the same column.
193193

194-
### b) Using named list elements in `i`
195-
When subsetting using the `on` argument, values in `i` are typically passed as unnamed elements. However, naming elements explicitly in `i` improves readability, especially when dealing with multiple keys.
196-
197-
- Example: Standard subsetting using unnamed elements
198-
```{r}
199-
flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
200-
```
201-
While this syntax is concise, it may not be immediately clear which value corresponds to which key in `on`.
202-
203-
- Subsetting using named elements in `i`
194+
* For clarity/readability, it might help to name the inputs in `i`, e.g.,
204195
```{r}
205-
flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
196+
flights[.(origin = "JFK", dest = "LAX"), on = c("origin", "dest")]
206197
```
207-
Naming elements explicitly `(origin = "LGA", dest = "TPA")` clarifies variable correspondence.
198+
This makes it clear which values correspond to which key.
208199

209-
- Using named lists with multiple values
210-
When multiple values are passed, named elements further enhance clarity:
211-
```{r unnamed_elemts}
212-
flights[.("LGA", "JFK", "EWR"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
213-
```
214-
215-
```{r named_elements}
216-
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
217-
```
218-
- Impact of named elements on key order
219-
```{r}
220-
flights[.(dest = "TPA", origin = "LGA"), on = .(origin, dest)]
221-
```
222-
- When to use named list elements in `i`.
223-
when working with multiple keys in `on`, as it improves readability.
224-
225-
### c) Select in `j`
200+
### b) Select in `j`
226201

227202
All the operations we will discuss below are no different to the ones we already saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. Except we'll be using the `on` argument instead of setting keys.
228203

229204
#### -- Return `arr_delay` column alone as a data.table corresponding to `origin = "LGA"` and `dest = "TPA"`
230205

231206
```{r}
232-
flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")]
207+
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
233208
```
234209

235-
### d) Chaining
210+
### c) Chaining
236211

237212
#### -- On the result obtained above, use chaining to order the column in decreasing order.
238213

239214
```{r}
240-
flights[.(origin = "LGA", dest = "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
215+
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
241216
```
242217

243-
### e) Compute or *do* in `j`
218+
### d) Compute or *do* in `j`
244219

245220
#### -- Find the maximum arrival delay corresponding to `origin = "LGA"` and `dest = "TPA"`.
246221

247222
```{r}
248-
flights[.(origin = "LGA", dest = "TPA"), max(arr_delay), on = c("origin", "dest")]
223+
flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
249224
```
250225

251-
### f) *sub-assign* by reference using `:=` in `j`
226+
### e) *sub-assign* by reference using `:=` in `j`
252227

253228
We have seen this example already in the vignettes [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) and [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html). Let's take a look at all the `hours` available in the `flights` *data.table*:
254229

@@ -271,7 +246,7 @@ flights[, sort(unique(hour))]
271246

272247
* This is particularly a huge advantage of secondary indices. Previously, just to update a few rows of `hour`, we had to `setkey()` on it, which inevitably reorders the entire data.table. With `on`, the order is preserved, and the operation is much faster! Looking at the code, the task we wanted to perform is also quite clear.
273248

274-
### g) Aggregation using `by`
249+
### f) Aggregation using `by`
275250

276251
#### -- Get the maximum departure delay for each `month` corresponding to `origin = "JFK"`. Order the result by `month`
277252

@@ -282,7 +257,7 @@ head(ans)
282257

283258
* We would have had to set the `key` back to `origin, dest` again, if we did not use `on` which internally builds secondary indices on the fly.
284259

285-
### h) The *mult* argument
260+
### g) The *mult* argument
286261

287262
The other arguments including `mult` work exactly the same way as we saw in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette. The default value for `mult` is "all". We can choose, instead only the "first" or "last" matching rows should be returned.
288263

@@ -295,17 +270,17 @@ flights[c("BOS", "DAY"), on = "dest", mult = "first"]
295270
#### -- Subset only the last matching row where `origin` matches *"LGA", "JFK", "EWR"* and `dest` matches *"XNA"*
296271

297272
```{r}
298-
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), on = c("origin", "dest"), mult = "last"]
273+
flights[.(c("LGA", "JFK", "EWR"), "XNA"), on = c("origin", "dest"), mult = "last"]
299274
```
300275

301-
### i) The *nomatch* argument
276+
### h) The *nomatch* argument
302277

303278
We can choose if queries that do not match should return `NA` or be skipped altogether using the `nomatch` argument.
304279

305280
#### -- From the previous example, subset all rows only if there's a match
306281

307282
```{r}
308-
flights[.(origin = c("LGA", "JFK", "EWR"), dest = "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
283+
flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
309284
```
310285

311286
* There are no flights connecting "JFK" and "XNA". Therefore, that row is skipped in the result.

0 commit comments

Comments
 (0)