Skip to content

Commit d03b1d0

Browse files
committed
edited wording in rowwise section
1 parent 3a651da commit d03b1d0

File tree

1 file changed

+33
-66
lines changed

1 file changed

+33
-66
lines changed

wrangling.Rmd

Lines changed: 33 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -790,10 +790,10 @@ lang_messy_longer <- pivot_longer(lang_messy,
790790
cols = Toronto:Edmonton,
791791
names_to = "region",
792792
values_to = "value")
793-
tidy_lang <- separate(lang_messy_longer, col = value,
793+
tidy_lang_chr <- separate(lang_messy_longer, col = value,
794794
into = c("most_at_home", "most_at_work"),
795795
sep = "/")
796-
official_langs_chr <- filter(tidy_lang, category == "Official languages")
796+
official_langs_chr <- filter(tidy_lang_chr, category == "Official languages")
797797
798798
official_langs_chr
799799
```
@@ -1129,7 +1129,7 @@ and pipe that into more functions after that.
11291129

11301130
As a part of many data analyses, we need to calculate a summary value for the
11311131
data (a *summary statistic*). Examples of summary statistics we might want to calculate are the
1132-
number of observations, the average/mean value for a column, the minimum value etc.
1132+
number of observations, the average/mean value for a column, the minimum value, etc.
11331133
A useful `dplyr` function for calculating summary statistics is
11341134
`summarize`, where the first argument is the data frame and the proceeding arguments
11351135
are the summaries we want to perform. Below we show how to use the `summarize` function to
@@ -1150,45 +1150,13 @@ primary language at home is spoken by
11501150
`r format(lang_summary$most_most_at_home[1], scientific = FALSE, big.mark = ",")`
11511151
people.
11521152

1153-
<!-- Suppose we wanted to find the maximum value for all the numeric columns in the `tidy_lang` data set. -->
1154-
1155-
<!-- We could apply `summarize` in the same way that we did above to find the maximum values: -->
1156-
1157-
<!-- ```{r} -->
1158-
1159-
<!-- lang_summary_max <- summarize(tidy_lang, -->
1160-
1161-
<!-- most_most_at_home = max(most_at_home), -->
1162-
1163-
<!-- most_most_at_work = max(most_at_work)) -->
1164-
1165-
<!-- lang_summary_max -->
1166-
1167-
<!-- ``` -->
1168-
1169-
<!-- The approach above is a valid way to do this, but if we had many numeric columns in our data set then this method would take a lot of time since we would have to explicitly write out the name of each column! A faster and less error-prone way to apply function(s) to columns that satisfy a certain condition is to use the `summarize_if` function. The first argument is the data set we want to summarize (`tidy_lang`). The second argument is the required condition, here if a particular column is numeric then the function will be applied. The third argument is the function we want to summarize with, here `max`. Therefore we write: -->
1170-
1171-
<!-- ```{r 02-summarize-if} -->
1172-
1173-
<!-- summarize_if(tidy_lang, -->
1174-
1175-
<!-- is.numeric, -->
1176-
1177-
<!-- max) -->
1178-
1179-
<!-- ``` -->
1180-
1181-
<!-- Notice that we get the same output as we did above! From the table, we see that the most commonly spoken -->
1182-
1183-
<!-- primary language at home is spoken by X people and the most commonly spoken language at work is spoken by X people. -->
1184-
11851153
### Calculating group summary statistics:
11861154

11871155
A common pairing with `summarize` is `group_by`. Pairing these functions
11881156
together can let you summarize values for subgroups within a data set. For
1189-
example, here, we can use `group_by` to group the regions and then calculate the
1190-
minimum and maximum number of Canadians reporting the language as the primary
1191-
language at home for each of the groups.
1157+
example, here, we can use `group_by` to group the regions of the `tidy_lang` dataframe
1158+
and then calculate the minimum and maximum number of Canadians
1159+
reporting the language as the primary language at home for each of the groups.
11921160

11931161
The `group_by` function takes at least two arguments. The first is the data
11941162
frame that will be grouped, and the second and onwards are columns to use in the
@@ -1205,7 +1173,7 @@ lang_summary_by_region
12051173
```
12061174

12071175
Notice that `group_by` on its own doesn't change the way the data looks. In the output below
1208-
the data set looks the same, and it doesn't *appear* to be grouped by `region`.
1176+
the grouped data set looks the same, and it doesn't *appear* to be grouped by `region`.
12091177
Instead, `group_by` simply changes how other functions work with the data, as we saw with `summarize` above.
12101178

12111179
```{r}
@@ -1387,42 +1355,40 @@ iteration. Additionally, their use is not limited to columns of a data frame;
13871355
`map_*` functions can be used to apply functions to elements of a vector or
13881356
list, and even to lists of data frames, or nested data frames.
13891357

1390-
## Iterating over rows in a data frame with `rowwise()`
1358+
## Apply functions across columns within one row with `rowwise`
13911359

1392-
1393-
What if you want to apply a function across rows instead of columns?
1360+
What if you want to apply a function across columns but within one row?
13941361
For instance, suppose we want to know the maximum value between `mother_tongue`,
1395-
`most_at_home`, `most_at_work` and `lang_known` for each language in Vancouver.
1362+
`most_at_home`, `most_at_work` and `lang_known` for each language in the `region_lang` data set?
13961363
In other words, we want to apply the `max` function row-wise. We will use the aptly
1397-
named function `rowwise` to accomplish this task. First, we `filter` the data for
1398-
only the languages in Vancouver. We also `select` specific columns simply
1399-
so we can see all the columns in the data frame output
1400-
but note that this step is not strictly necessary.
1364+
named function `rowwise` in combination with `mutate` to accomplish this task.
1365+
>**Note:** Before we apply `rowwise` we will `select` only the count columns
1366+
so we can see all the columns in the dataframe's output easily in the book.
14011367

1402-
```{r vancouver_filter}
1403-
vancouver_lang <- region_lang |>
1404-
filter(region == "Vancouver") |>
1405-
select(region, language:lang_known)
1406-
vancouver_lang
1407-
```
1408-
Similar to `group_by`, `rowwise` doesn't do anything when it is called by itself,
1409-
however, we can apply `rowwise` in combination with other functions to change how
1410-
these other functions operate on the data. We will use `rowwise` and `mutate`
1411-
to find the maximum count for each language in the data set.
14121368
```{r}
1413-
vancouver_lang |>
1369+
region_lang |>
1370+
select(mother_tongue:lang_known) |>
14141371
rowwise() |>
14151372
mutate(maximum = max(c(mother_tongue, most_at_home, most_at_work, lang_known)))
14161373
```
1374+
1375+
Similar to `group_by`, `rowwise` doesn't do anything when it is called by itself,
1376+
however, we can apply `rowwise` in combination with other functions to change how
1377+
these other functions operate on the data.
14171378
Notice if we used `mutate` without `rowwise`, we would have computed the maximum
1418-
value across *all* rows rather than the maximum value for *each* row. Therefore in the output below
1419-
`r format(vancouver_lang |> mutate(maximum = max(c(mother_tongue, most_at_home, most_at_work, lang_known))) |> slice(1) |> pull(maximum), scientific = FALSE, big.mark = ",")` is reported as the maximum value in every single row since it is
1379+
value across *all* rows rather than the maximum value for *each* row.
1380+
Therefore in the output below the same maximum value is reported
1381+
in every single row since it is
14201382
the maximum value among *all* the rows, so this code is not doing what we want.
14211383

14221384
```{r}
1423-
vancouver_lang |>
1385+
region_lang |>
1386+
select(mother_tongue:lang_known) |>
14241387
mutate(maximum = max(c(mother_tongue, most_at_home, most_at_home, lang_known)))
14251388
```
1389+
1390+
## Summary
1391+
14261392
Cleaning and wrangling data can be a very time-consuming process, however,
14271393
it is a critical step in any data analysis. We have explored many different
14281394
functions for cleaning and wrangling data into a tidy format.
@@ -1435,16 +1401,17 @@ Table: (#tab:summary-functions-table) Summary of wrangling functions
14351401

14361402
| Function | Description |
14371403
| --- | ----------- |
1404+
| `across` | allows you to apply function(s) to multiple columns |
1405+
| `filter` | subsets rows of a data frame |
1406+
| `group_by` | allows you to apply function(s) to groups of rows |
1407+
| `mutate` | adds or modifies columns in a data frame |
1408+
| `map` | generally iteration function |
14381409
| `pivot_longer` | generally makes the data frame longer and narrower |
1410+
| `rowwise` | applies functions across columns within one row |
14391411
| `pivot_wider` | generally makes a data frame wider and decreases the number of rows |
14401412
| `separate` | splits up a character column into multiple columns |
14411413
| `select` | subsets columns of a data frame |
1442-
| `filter` | subsets rows of a data frame |
1443-
| `mutate` | adds or modifies columns in a data frame |
14441414
| `summarize` | calculates summaries of inputs |
1445-
| `group_by` | allows you to apply function(s) to groups of rows |
1446-
| `across` | allows you to apply function(s) to multiple columns |
1447-
| `rowwise` | allows you to apply function(s) across rows of a data frame |
14481415

14491416
## Additional resources
14501417

0 commit comments

Comments
 (0)