Skip to content

Commit ec9a2af

Browse files
committed
added NA section for summarize + across
1 parent 15ec161 commit ec9a2af

File tree

1 file changed

+46
-19
lines changed

1 file changed

+46
-19
lines changed

wrangling.Rmd

Lines changed: 46 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1195,7 +1195,7 @@ We show an example of this below.
11951195

11961196
First we create a seemingly innocuous NA
11971197
in the first row of the `region_lang` data frame,
1198-
in the most_at_home column:
1198+
in the `most_at_home column`:
11991199

12001200
```{r}
12011201
region_lang_na <- region_lang
@@ -1262,12 +1262,14 @@ group_by(region_lang, region)
12621262

12631263
### Calculating summary statistics on many columns
12641264

1265+
#### `summarize` + `across` for calculating summary statistics on many columns
1266+
12651267
Sometimes we need to summarize statistics across many columns.
12661268
In such a case, using `summarize` alone means that we have to
12671269
type out the name of each column we want to summarize.
12681270
To do this more efficiently, we can pair `summarize` with `across`
12691271
and use the same syntax we use with the `select` function to
1270-
specify which columns we would like to perform the statistical summarries on,
1272+
specify which columns we would like to perform the statistical summaries on,
12711273
as well as which function to use to calculate these.
12721274
Here we demonstrate finding the maximum value of each of the numeric
12731275
columns of the `region_lang` data set.
@@ -1277,6 +1279,29 @@ region_lang |>
12771279
summarize(across(mother_tongue:lang_known, max))
12781280
```
12791281

1282+
> **Note on calculating summary statistics with `summarize` + `across`**
1283+
> **when there are NA's**:
1284+
>
1285+
> Similarly to when we use base R statistical summary functions
1286+
> (e.g., `max`, `mix`, `mean`, `sum`, etc) with `summarize` alone,
1287+
> the use of the `summarize` + `across` functions paired
1288+
> with base R statistical summary functions
1289+
> also return NA's when we apply them to columns that
1290+
> contain NAs in the data frame.
1291+
>
1292+
> To avoid this, again we need to add the argument `na.rm = TRUE`,
1293+
> but in this case we need to use it a little bit differently.
1294+
> In this case, we need to add a `,` and then `na.rm = TRUE`,
1295+
> after specifying the function we want `summarize` + `across` to apply,
1296+
> as illustrated below:
1297+
>
1298+
> ``` {r}
1299+
> region_lang_na |>
1300+
> summarize(across(mother_tongue:lang_known, max, na.rm = TRUE))
1301+
> ```
1302+
1303+
#### `map` for calculating summary statistics on many columns
1304+
12801305
An alternative to `summarize` and `across`
12811306
for applying a function to many columns is the `map` family of functions.
12821307
Let's again find the maximum value of each column of the
@@ -1339,21 +1364,23 @@ region_lang |>
13391364
Which `map` function you choose depends on what you want to do with the
13401365
output; you don't always have to pick `map_dfc`!
13411366

1342-
Similarly to when we use base R statistical summary functions
1343-
(e.g., `max`, `mix`, `mean`, `sum`, etc) with `summarize`,
1344-
`map` functions paired with base R statistical summary functions
1345-
also return NA's when we apply them to columns that
1346-
contain NAs in the data frame.
1347-
1348-
To avoid this, again we need to add the argument `na.rm = TRUE`.
1349-
When we use this with `map` we do this by adding a `,` and then `na.rm = TRUE`,
1350-
after specifying the function we want map to apply, as illustrated below:
1351-
1352-
``` {r}
1353-
region_lang |>
1354-
select(mother_tongue:lang_known) |>
1355-
map_dfc(max, na.rm = TRUE)
1356-
```
1367+
> **Note on calculating summary statistics with `map` when there are NA's**:
1368+
>
1369+
> Similarly to when we use base R statistical summary functions
1370+
> (e.g., `max`, `mix`, `mean`, `sum`, etc) with `summarize`,
1371+
> `map` functions paired with base R statistical summary functions
1372+
> also return NA's when we apply them to columns that
1373+
> contain NAs in the data frame.
1374+
>
1375+
> To avoid this, again we need to add the argument `na.rm = TRUE`.
1376+
> When we use this with `map` we do this by adding a `,` and then `na.rm = TRUE`,
1377+
> after specifying the function we want `map` to apply, as illustrated below:
1378+
>
1379+
> ``` {r}
1380+
> region_lang_na |>
1381+
> select(mother_tongue:lang_known) |>
1382+
> map_dfc(max, na.rm = TRUE)
1383+
> ```
13571384
13581385
The `map` family functions are generally quite useful for solving many problems
13591386
involving repeatedly applying functions in R.
@@ -1480,9 +1507,9 @@ Table: (#tab:summary-functions-table) Summary of wrangling functions
14801507
`pivot_longer`/`pivot_wider` and `separate`, but also covers missing values
14811508
and additional wrangling functions (like `unite`). The [data
14821509
transformation](https://r4ds.had.co.nz/transform.html) chapter covers
1483-
`select`, `filter`, `arrange`, `mutate`, and `summarize`. And the [`map_*`
1510+
`select`, `filter`, `arrange`, `mutate`, and `summarize`. And the [`map`
14841511
functions](https://r4ds.had.co.nz/iteration.html#the-map-functions) chapter
1485-
provides more about the `map_*` functions.
1512+
provides more about the `map` functions.
14861513
- You will occasionally encounter a case where you need to iterate over items
14871514
in a data frame, but none of the above functions are flexible enough to do
14881515
what you want. In that case, you may consider using [a for

0 commit comments

Comments
 (0)