Skip to content

Commit b8677f7

Browse files
Merge pull request #387 from UBC-DSCI/wrangling_edit
wrangling edit pass
2 parents ea1c317 + f84040e commit b8677f7

File tree

1 file changed

+50
-46
lines changed

1 file changed

+50
-46
lines changed

wrangling.Rmd

Lines changed: 50 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,14 @@ application, providing more practice working through a whole case study.
1919

2020
## Chapter learning objectives
2121

22-
By the end of the chapter, readers will be able to:
23-
24-
- define the term "tidy data"
25-
- discuss the advantages of storing data in a tidy data format
26-
- define what vectors, lists, and data frames are in R, and describe how they relate to
27-
each other
28-
- describe the common types of data in R and their uses
29-
- recall and use the following functions for their
22+
By the end of the chapter, readers will be able to do the following:
23+
24+
- Define the term "tidy data".
25+
- Discuss the advantages of storing data in a tidy data format.
26+
- Define what vectors, lists, and data frames are in R, and describe how they relate to
27+
each other.
28+
- Describe the common types of data in R and their uses.
29+
- Recall and use the following functions for their
3030
intended data wrangling tasks:
3131
- `across`
3232
- `c`
@@ -41,7 +41,7 @@ By the end of the chapter, readers will be able to:
4141
- `rowwise`
4242
- `separate`
4343
- `summarize`
44-
- recall and use the following operators for their
44+
- Recall and use the following operators for their
4545
intended data wrangling tasks:
4646
- `==`
4747
- `%in%`
@@ -75,8 +75,8 @@ that is designed to store observations, variables, and their values.
7575
Most commonly, each column in a data frame corresponds to a variable,
7676
and each row corresponds to an observation. For example, Figure
7777
\@ref(fig:02-obs) displays a data set of city populations. Here, the variables
78-
are "region, year, population;" each of these are properties that can be
79-
collected or measured. The first observation is "Toronto, 2016, 2235145;"
78+
are "region, year, population"; each of these are properties that can be
79+
collected or measured. The first observation is "Toronto, 2016, 2235145";
8080
these are the values that the three variables take for the first entity in the
8181
data set. There are 13 entities in the data set in total, corresponding to the
8282
13 rows in Figure \@ref(fig:02-obs).
@@ -420,7 +420,7 @@ before the maximum can be computed.
420420
In comparison, if the data were tidy,
421421
all we would have to do is compute the maximum value for the commuter column.
422422
To reshape this untidy data set to a tidy (and in this case, wider) format,
423-
we need to create a column called "population", "commuters", and "incorporated."
423+
we need to create columns called "population", "commuters", and "incorporated."
424424
This is illustrated in the right table of Figure \@ref(fig:long-to-wide).
425425

426426
``` {r long-to-wide, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Going from long to wide data.", fig.retina = 2, out.width = "100%"}
@@ -486,8 +486,8 @@ The data above is now tidy! We can go through the three criteria again to check
486486
that this data is a tidy data set.
487487

488488
1. All the statistical variables are their own columns in the data frame (i.e.,
489-
`most_at_home`, and `most_at_work`) have been separated into their own
490-
columns in the data frame.
489+
`most_at_home`, and `most_at_work` have been separated into their own
490+
columns in the data frame).
491491
2. Each observation, (i.e., each language in a region) is in a single row.
492492
3. Each value is a single cell (i.e., its row, column position in the data
493493
frame is not shared with another value).
@@ -567,7 +567,7 @@ analyze. But we aren't done yet! Notice in the table above that the word
567567
`<chr>` appears beneath each of the column names. The word under the column name
568568
indicates the data type of each column. Here all of our variables are
569569
"character" data types. Recall, character data types are letter(s) or digits(s)
570-
surrounded by quotes. In the previous example in section \@ref(pivot-wider), the
570+
surrounded by quotes. In the previous example in Section \@ref(pivot-wider), the
571571
`most_at_home` and `most_at_work` variables were `<dbl>` (double)&mdash;you can
572572
verify this by looking at the tables in the previous sections&mdash;which is a type
573573
of numeric data. This change is due to the delimiter (`/`) when we read in this
@@ -773,6 +773,8 @@ five_cities <- filter(region_data,
773773
five_cities
774774
```
775775

776+
\newpage
777+
776778
> **Note:** What's the difference between `==` and `%in%`? Suppose we have two
777779
> vectors, `vectorA` and `vectorB`. If you type `vectorA == vectorB` into R it
778780
> will compare the vectors element by element. R checks if the first element of
@@ -795,20 +797,20 @@ census_popn <- 35151728
795797
most_french <- 2669195
796798
```
797799
798-
We saw in section \@ref(filter-and) that
800+
We saw in Section \@ref(filter-and) that
799801
`r format(most_french, scientific = FALSE, big.mark = ",")` people reported
800802
speaking French in Montréal as their primary language at home.
801803
If we are interested in finding the official languages in regions
802804
with higher numbers of people who speak it as their primary language at home
803-
compared to French in Montréal then we can use `filter` to obtain rows
805+
compared to French in Montréal, then we can use `filter` to obtain rows
804806
where the value of `most_at_home` is greater than
805807
`r format(most_french, scientific = FALSE, big.mark = ",")`.
806808

807809
``` {r}
808810
filter(official_langs, most_at_home > 2669195)
809811
```
810812

811-
`filter` returns a data frame with only one row indicating that when
813+
`filter` returns a data frame with only one row, indicating that when
812814
considering the official languages,
813815
only English in Toronto is reported by more people
814816
as their primary language at home
@@ -818,7 +820,7 @@ than French in Montréal according to the 2016 Canadian census.
818820

819821
### Using `mutate` to modify columns
820822
In Section \@ref(separate),
821-
when we first read in the `"region_lang_top5_cities_messy.csv"` data
823+
when we first read in the `"region_lang_top5_cities_messy.csv"` data,
822824
all of the variables were "character" data types. \index{mutate}
823825
During the tidying process,
824826
we used the `convert` argument from the `separate` function
@@ -905,7 +907,7 @@ number, we need context. In particular, how many people were in Toronto when
905907
this data was collected? From the 2016 Canadian census profile, the population
906908
of Toronto was reported to be
907909
`r format(toronto_popn, scientific = FALSE, big.mark = ",")` people.
908-
The number of people who report that English as their primary language at home
910+
The number of people who report that English is their primary language at home
909911
is much more meaningful when we report it in this context.
910912
We can even go a step further and transform this count to a relative frequency
911913
or proportion.
@@ -923,9 +925,9 @@ for our five cities of focus in this chapter.
923925
To accomplish this, we will need to do two tasks
924926
beforehand:
925927

926-
1. create a vector containing the population values for our cities
927-
2. filter the `official_langs` data frame
928-
so that we only keep the rows where the language is English
928+
1. Create a vector containing the population values for our cities.
929+
2. Filter the `official_langs` data frame
930+
so that we only keep the rows where the language is English.
929931

930932
To create a vector containing the population values for our cities
931933
(Toronto, Montréal, Vancouver, Calgary, Edmonton),
@@ -963,7 +965,7 @@ same order as the cities were listed in the `english_langs` data frame.
963965
This is because R will perform the division computation we did by dividing
964966
each element of the `most_at_home` column by each element of the
965967
`city_pops` vector, matching them up by position.
966-
Failing to do this would have resulted in the incorrect math to be performed.
968+
Failing to do this would have resulted in the incorrect math being performed.
967969

968970
> **Note:** In more advanced data wrangling,
969971
> one might solve this problem in a less error-prone way though using
@@ -1010,7 +1012,7 @@ frame. The basic ways of doing this can become quickly unreadable if there are
10101012
many steps. For example, suppose we need to perform three operations on a data
10111013
frame called `data`: \index{pipe}\index{aaapipesymb@\vert{}>|see{pipe}}
10121014

1013-
1) add a new column `new_col` that is double another `old_col`
1015+
1) add a new column `new_col` that is double another `old_col`,
10141016
2) filter for rows where another column, `other_col`, is more than 5, and
10151017
3) select only the new column `new_col` for those rows.
10161018

@@ -1060,7 +1062,7 @@ output <- data |>
10601062

10611063
> **Note:** You might also have noticed that we split the function calls across
10621064
> lines after the pipe, similar to when we did this earlier in the chapter
1063-
> for long function calls. Again this is allowed and recommended, especially when
1065+
> for long function calls. Again, this is allowed and recommended, especially when
10641066
> the piped function calls create a long line of code. Doing this makes
10651067
> your code more readable. When you do this, it is important to end each line
10661068
> with the pipe operator `|>` to tell R that your code is continuing onto the
@@ -1074,9 +1076,9 @@ output <- data |>
10741076
> (which in turn imports the `magrittr` R package).
10751077
> There are some other differences between `%>%` and `|>` related to
10761078
> more advanced R uses, such as sharing and distributing code as R packages,
1077-
> however these are beyond the scope of this textbook.
1079+
> however, these are beyond the scope of this textbook.
10781080
> We have this note in the book to make the reader aware that `%>%` exists
1079-
> as it still commonly used in data analysis code and in many data science
1081+
> as it is still commonly used in data analysis code and in many data science
10801082
> books and other resources.
10811083
> In most cases these two pipes are interchangeable and either can be used.
10821084
@@ -1112,7 +1114,7 @@ van_data_selected
11121114

11131115
Although this is valid code, there is a more readable approach we could take by
11141116
using the pipe, `|>`. With the pipe, we do not need to create an intermediate
1115-
object to store the output from `filter`. Instead we can directly send the
1117+
object to store the output from `filter`. Instead, we can directly send the
11161118
output of `filter` to the input of `select`:
11171119

11181120
``` {r}
@@ -1131,12 +1133,12 @@ as the first argument for the function that comes after it.
11311133
Therefore you do not specify the first argument in that function call.
11321134
In the code above,
11331135
the first line is just the `tidy_lang` data frame with a pipe.
1134-
The pipe passes the left hand side (`tidy_lang`) to the first argument of the function on the right (`filter`),
1136+
The pipe passes the left-hand side (`tidy_lang`) to the first argument of the function on the right (`filter`),
11351137
so in the `filter` function you only see the second argument (and beyond).
11361138
Then again after `filter` there is a pipe, which passes the result of the `filter` step
11371139
to the first argument of the `select` function.
11381140
As you can see, both of these approaches&mdash;with and without pipes&mdash;give us the same output, but the second
1139-
approach is more clear and readable.
1141+
approach is clearer and more readable.
11401142

11411143
### Using `|>` with more than two functions
11421144

@@ -1152,8 +1154,7 @@ from smallest to largest.
11521154
As we saw in Chapter \@ref(intro),
11531155
we can use the `tidyverse` `arrange` function \index{arrange}
11541156
to order the rows in the data frame by the values of one or more columns.
1155-
Here we pass the column name `most_at_home` to arrange
1156-
to order the data frame rows by the values in that column, in ascending order.
1157+
Here we pass the column name `most_at_home` to arrange the data frame rows by the values in that column, in ascending order.
11571158

11581159
``` {r}
11591160
large_region_lang <- filter(tidy_lang, most_at_home > 10000) |>
@@ -1237,13 +1238,13 @@ lang_summary <- summarize(region_lang,
12371238
max_most_at_home = max(most_at_home))
12381239
```
12391240

1240-
From this we see that there are some languages in the data set the no one speaks
1241+
From this we see that there are some languages in the data set that no one speaks
12411242
as their primary language at home. We also see that the most commonly spoken
12421243
primary language at home is spoken by
12431244
`r format(lang_summary$max_most_at_home[1], scientific = FALSE, big.mark = ",")`
12441245
people.
12451246

1246-
#### Calculating summary statistics when there are `NA`s {-}
1247+
### Calculating summary statistics when there are `NA`s
12471248

12481249
In data frames in R, the value `NA` is often used to denote missing data.
12491250
Many of the base R statistical summary functions
@@ -1272,7 +1273,7 @@ region_lang_na[["most_at_home"]][1] <- NA
12721273
region_lang_na
12731274
```
12741275

1275-
Now if we apply our summarize function as above,
1276+
Now if we apply our `summarize` function as above,
12761277
we see that no longer get the minimum and maximum returned,
12771278
but just an `NA` instead!
12781279

@@ -1299,7 +1300,7 @@ For example, we can use `group_by` to group the regions of the `tidy_lang` data
12991300
reporting the language as the primary language at home
13001301
for each of the regions in the data set.
13011302

1302-
(ref:summarize-groupby) `summarize` and `group_by` is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame&mdash;with one row for each group&mdash;containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The grey, blue and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example.
1303+
(ref:summarize-groupby) `summarize` and `group_by` is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame&mdash;with one row for each group&mdash;containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The gray, blue, and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example.
13031304

13041305
```{r summarize-groupby, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:summarize-groupby)", fig.retina = 2, out.width = "100%"}
13051306
image_read("img/summarize/summarize.002.jpeg") |>
@@ -1320,7 +1321,7 @@ group_by(region_lang, region) |>
13201321
```
13211322

13221323
Notice that `group_by` on its own doesn't change the way the data looks.
1323-
In the output below the grouped data set looks the same,
1324+
In the output below, the grouped data set looks the same,
13241325
and it doesn't *appear* to be grouped by `region`.
13251326
Instead, `group_by` simply changes how other functions work with the data,
13261327
as we saw with `summarize` above.
@@ -1366,7 +1367,7 @@ region_lang |>
13661367
summarize(across(mother_tongue:lang_known, max))
13671368
```
13681369

1369-
> **Note:** Similarly to when we use base R statistical summary functions
1370+
> **Note:** Similar to when we use base R statistical summary functions
13701371
> (e.g., `max`, `min`, `mean`, `sum`, etc) with `summarize` alone,
13711372
> the use of the `summarize` + `across` functions paired
13721373
> with base R statistical summary functions
@@ -1392,7 +1393,8 @@ Let's again find the maximum value of each column of the
13921393
`region_lang` data frame, but using `map` with the `max` function this time.
13931394
`map` takes two arguments:
13941395
an object (a vector, data frame or list) that you want to apply the function to,
1395-
and the function that you would like to apply to each column.
1396+
and the function that you would like to apply to each column.
1397+
13961398
Note that `map` does not have an argument
13971399
to specify *which* columns to apply the function to.
13981400
Therefore, we will do this before calling `map` using the `select` function.
@@ -1422,7 +1424,7 @@ So what do we do? Should we convert this to a data frame? We could, but a
14221424
simpler alternative is to just use a different `map` function. There
14231425
are quite a few to choose from, they all work similarly, but
14241426
their name reflects the type of output you want from the mapping operation.
1425-
Table \@ref(tab:map-table) lists the commonly-used `map` functions as well
1427+
Table \@ref(tab:map-table) lists the commonly used `map` functions as well
14261428
as their output type. \index{map!map\_\* functions}
14271429

14281430
Table: (#tab:map-table) The `map` functions in R.
@@ -1446,14 +1448,16 @@ region_lang |>
14461448
map_dfr(max)
14471449
```
14481450

1449-
> **Note:** Similarly to when we use base R statistical summary functions
1451+
\newpage
1452+
1453+
> **Note:** Similar to when we use base R statistical summary functions
14501454
> (e.g., `max`, `min`, `mean`, `sum`, etc.) with `summarize`,
14511455
> `map` functions paired with base R statistical summary functions
14521456
> also return `NA` values when we apply them to columns that
14531457
> contain `NA` values. \index{missing data}
14541458
>
14551459
> To avoid this, again we need to add the argument `na.rm = TRUE`.
1456-
> When we use this with `map` we do this by adding a `,`
1460+
> When we use this with `map`, we do this by adding a `,`
14571461
> and then `na.rm = TRUE` after specifying the function, as illustrated below:
14581462
>
14591463
> ``` {r}
@@ -1545,7 +1549,7 @@ region_lang |>
15451549
Now we apply `rowwise` before `mutate`, to tell R that we would like
15461550
our mutate function to be applied across, and within, a row,
15471551
as opposed to being applied on a column
1548-
(which is the default behaviour of `mutate`):
1552+
(which is the default behavior of `mutate`):
15491553

15501554
```{r}
15511555
region_lang |>
@@ -1629,7 +1633,7 @@ found in Chapter \@ref(move-to-your-own-machine).
16291633
To learn more about these functions and meet a few more useful
16301634
functions, we recommend you check out [this
16311635
chapter](http://stat545.com/block010_dplyr-end-single-table.html#where-were-we)
1632-
of the Data wrangling, exploration, and analysis with R book.
1636+
of the data wrangling, exploration, and analysis with R book.
16331637
- The [`dplyr` page on the tidyverse website](https://dplyr.tidyverse.org/) is
16341638
another resource to learn more about the functions in this
16351639
chapter, the full set of arguments you can use, and other related functions.
@@ -1638,7 +1642,7 @@ found in Chapter \@ref(move-to-your-own-machine).
16381642
- Check out the [tidyselect
16391643
page](https://tidyselect.r-lib.org/reference/select_helpers.html) for a
16401644
comprehensive list of `select` helpers.
1641-
- [R for Data Science](https://r4ds.had.co.nz/) has a few chapters related to
1645+
- [*R for Data Science*](https://r4ds.had.co.nz/) has a few chapters related to
16421646
data wrangling that go into more depth than this book. For example, the
16431647
[tidy data](https://r4ds.had.co.nz/tidy-data.html) chapter covers tidy data,
16441648
`pivot_longer`/`pivot_wider` and `separate`, but also covers missing values

0 commit comments

Comments
 (0)