Skip to content

Commit d88274f

Browse files
committed
merge changes on .Rmd from master, add automated links
1 parent d6dee12 commit d88274f

22 files changed

+2017
-79
lines changed

vignettes/datatable-faq.Rmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,10 @@ h2 {
2626
}
2727
</style>
2828

29+
```{r echo=FALSE, file='_translation_links.R'}
30+
```
31+
`r .write.translation.links("Translations of this document are available in: %s")`
32+
2933
```{r, echo = FALSE, message = FALSE}
3034
library(data.table)
3135
knitr::opts_chunk$set(
@@ -37,10 +41,6 @@ knitr::opts_chunk$set(
3741
.old.th = setDTthreads(1)
3842
```
3943

40-
```{r echo=FALSE, file='_translation_links.R'}
41-
```
42-
`r .write.translation.links("Translations of this document are available in: %s")`
43-
4444
The first section, Beginner FAQs, is intended to be read in order, from start to finish. It's just written in a FAQ style to be digested more easily. It isn't really the most frequently asked questions. A better measure for that is looking on Stack Overflow.
4545

4646
This FAQ is required reading and considered core documentation. Please do not ask questions on Stack Overflow or raise issues on GitHub until you have read it. We can all tell when you ask that you haven't read it. So if you do ask and haven't read it, don't use your real name.

vignettes/datatable-intro.Rmd

Lines changed: 29 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ vignette: >
99
\usepackage[utf8]{inputenc}
1010
---
1111

12+
```{r echo=FALSE, file='_translation_links.R'}
13+
```
14+
`r .write.translation.links("Translations of this document are available in: %s")`
15+
1216
```{r, echo = FALSE, message = FALSE}
1317
require(data.table)
1418
knitr::opts_chunk$set(
@@ -21,10 +25,6 @@ knitr::opts_chunk$set(
2125
.old.th = setDTthreads(1)
2226
```
2327

24-
```{r echo=FALSE, file='_translation_links.R'}
25-
```
26-
`r .write.translation.links("Translations of this document are available in: %s")`
27-
2828
This vignette introduces the `data.table` syntax, its general form, how to *subset* rows, *select and compute* on columns, and perform aggregations *by group*. Familiarity with the `data.frame` data structure from base R is useful, but not essential to follow this vignette.
2929

3030
***
@@ -316,7 +316,7 @@ ans
316316

317317
We could have accomplished the same operation by doing `nrow(flights[origin == "JFK" & month == 6L])`. However, it would have to subset the entire `data.table` first corresponding to the *row indices* in `i` *and then* return the rows using `nrow()`, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the *`data.table` design* vignette.
318318

319-
### h) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
319+
### h) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer-j}
320320

321321
If you're writing out the column names explicitly, there's no difference compared to a `data.frame` (since v1.9.8).
322322

@@ -422,7 +422,7 @@ ans
422422
423423
We'll use this convenient form wherever applicable hereafter.
424424
425-
#### -- How can we calculate the number of trips for each origin airport for carrier code `"AA"`? {#origin-.N}
425+
#### -- How can we calculate the number of trips for each origin airport for carrier code `"AA"`? {#origin-N}
426426
427427
The unique carrier code `"AA"` corresponds to *American Airlines Inc.*
428428
@@ -435,7 +435,7 @@ ans
435435

436436
* Using those *row indices*, we obtain the number of rows while grouped by `origin`. Once again no columns are actually materialised here, because the `j-expression` does not require any columns to be actually subsetted and is therefore fast and memory efficient.
437437

438-
#### -- How can we get the total number of trips for each `origin, dest` pair for carrier code `"AA"`? {#origin-dest-.N}
438+
#### -- How can we get the total number of trips for each `origin, dest` pair for carrier code `"AA"`? {#origin-dest-N}
439439

440440
```{r}
441441
ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
@@ -483,7 +483,7 @@ We'll learn more about `keys` in the [`vignette("datatable-keys-fast-subset", pa
483483

484484
### c) Chaining
485485

486-
Let's reconsider the task of [getting the total number of trips for each `origin, dest` pair for carrier *"AA"*](#origin-dest-.N).
486+
Let's reconsider the task of [getting the total number of trips for each `origin, dest` pair for carrier *"AA"*](#origin-dest-N).
487487

488488
```{r}
489489
ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
@@ -583,7 +583,7 @@ We are almost there. There is one little thing left to address. In our `flights`
583583

584584
Using the argument `.SDcols`. It accepts either column names or column indices. For example, `.SDcols = c("arr_delay", "dep_delay")` ensures that `.SD` contains only these two columns for each group.
585585

586-
Similar to [part g)](#refer_j), you can also specify the columns to remove instead of columns to keep using `-` or `!`. Additionally, you can select consecutive columns as `colA:colB` and deselect them as `!(colA:colB)` or `-(colA:colB)`.
586+
Similar to [part g)](#refer-j), you can also specify the columns to remove instead of columns to keep using `-` or `!`. Additionally, you can select consecutive columns as `colA:colB` and deselect them as `!(colA:colB)` or `-(colA:colB)`.
587587

588588
Now let us try to use `.SD` along with `.SDcols` to get the `mean()` of `arr_delay` and `dep_delay` columns grouped by `origin`, `dest` and `month`.
589589

@@ -643,6 +643,26 @@ DT[, print(list(c(a,b))), by = ID] # (2)
643643

644644
In (1), for each group, a vector is returned, with length = 6,4,2 here. However, (2) returns a list of length 1 for each group, with its first element holding vectors of length 6,4,2. Therefore, (1) results in a length of ` 6+4+2 = `r 6+4+2``, whereas (2) returns `1+1+1=`r 1+1+1``.
645645

646+
Flexibility of j allows us to store any list object as an element of data.table. For example, when statistical models are fit to groups, these models can be stored in a data.table. Code is concise and easy to understand.
647+
648+
```{r}
649+
## Do long distance flights cover up departure delay more than short distance flights?
650+
## Does cover up vary by month?
651+
flights[, `:=`(makeup = dep_delay - arr_delay)]
652+
653+
makeup.models <- flights[, .(fit = list(lm(makeup ~ distance))), by = .(month)]
654+
makeup.models[, .(coefdist = coef(fit[[1]])[2], rsq = summary(fit[[1]])$r.squared), by = .(month)]
655+
```
656+
Using data.frames, we need more complicated code to obtain same result.
657+
```{r}
658+
setDF(flights)
659+
flights.split <- split(flights, f = flights$month)
660+
makeup.models.list <- lapply(flights.split, function(df) c(month = df$month[1], fit = list(lm(makeup ~ distance, data = df))))
661+
makeup.models.df <- do.call(rbind, makeup.models.list)
662+
sapply(makeup.models.df[, "fit"], function(model) c(coefdist = coef(model)[2], rsq = summary(model)$r.squared)) |> t() |> data.frame()
663+
setDT(flights)
664+
```
665+
646666
## Summary
647667

648668
The general form of `data.table` syntax is:

0 commit comments

Comments
 (0)