Merge pull request #387 from UBC-DSCI/wrangling_edit

trevorcampbell · web-flow · commit b8677f74c734 · 2021-12-02T21:04:32.000-08:00
wrangling edit pass
diff --git a/wrangling.Rmd b/wrangling.Rmd
@@ -19,14 +19,14 @@ application, providing more practice working through a whole case study.
 
 ## Chapter learning objectives
 
-By the end of the chapter, readers will be able to:
-
-  - define the term "tidy data"
-  - discuss the advantages of storing data in a tidy data format
-  - define what vectors, lists, and data frames are in R, and describe how they relate to
-    each other
-  - describe the common types of data in R and their uses
-  - recall and use the following functions for their
+By the end of the chapter, readers will be able to do the following:
+
+  - Define the term "tidy data".
+  - Discuss the advantages of storing data in a tidy data format.
+  - Define what vectors, lists, and data frames are in R, and describe how they relate to
+    each other.
+  - Describe the common types of data in R and their uses.
+  - Recall and use the following functions for their
     intended data wrangling tasks:
       - `across`
       - `c`
@@ -41,7 +41,7 @@ By the end of the chapter, readers will be able to:
       - `rowwise`
       - `separate`
       - `summarize`
-  - recall and use the following operators for their
+  - Recall and use the following operators for their
     intended data wrangling tasks:
       - `==` 
       - `%in%`
@@ -75,8 +75,8 @@ that is designed to store observations, variables, and their values.
 Most commonly, each column in a data frame corresponds to a variable,
 and each row corresponds to an observation. For example, Figure
 \@ref(fig:02-obs) displays a data set of city populations. Here, the variables
-are "region, year, population;" each of these are properties that can be
-collected or measured.  The first observation is "Toronto, 2016, 2235145;"
+are "region, year, population"; each of these are properties that can be
+collected or measured.  The first observation is "Toronto, 2016, 2235145";
 these are the values that the three variables take for the first entity in the
 data set. There are 13 entities in the data set in total, corresponding to the
 13 rows in Figure \@ref(fig:02-obs).
@@ -420,7 +420,7 @@ before the maximum can be computed.
 In comparison, if the data were tidy, 
 all we would have to do is compute the maximum value for the commuter column.
 To reshape this untidy data set to a tidy (and in this case, wider) format,
-we need to create a column called "population", "commuters", and "incorporated."
+we need to create columns called "population", "commuters", and "incorporated."
 This is illustrated in the right table of Figure \@ref(fig:long-to-wide).
 
 ``` {r long-to-wide, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Going from long to wide data.", fig.retina = 2, out.width = "100%"}
@@ -486,8 +486,8 @@ The data above is now tidy! We can go through the three criteria again to check
 that this data is a tidy data set.
 
 1.  All the statistical variables are their own columns in the data frame (i.e.,
-    `most_at_home`, and `most_at_work`) have been separated into their own
-    columns in the data frame.
+    `most_at_home`, and `most_at_work` have been separated into their own
+    columns in the data frame).
 2.  Each observation, (i.e., each language in a region) is in a single row.
 3.  Each value is a single cell (i.e., its row, column position in the data
     frame is not shared with another value).
@@ -567,7 +567,7 @@ analyze. But we aren't done yet! Notice in the table above that the word
 `<chr>` appears beneath each of the column names. The word under the column name
 indicates the data type of each column. Here all of our variables are
 "character" data types. Recall, character data types are letter(s) or digits(s)
-surrounded by quotes. In the previous example in section \@ref(pivot-wider), the
+surrounded by quotes. In the previous example in Section \@ref(pivot-wider), the
 `most_at_home` and `most_at_work` variables were `<dbl>` (double)&mdash;you can
 verify this by looking at the tables in the previous sections&mdash;which is a type
 of numeric data. This change is due to the delimiter (`/`) when we read in this
@@ -773,6 +773,8 @@ five_cities <- filter(region_data,
 five_cities
 ```
 
+\newpage
+
 > **Note:** What's the difference between `==` and `%in%`? Suppose we have two
 > vectors, `vectorA` and `vectorB`. If you type `vectorA == vectorB` into R it
 > will compare the vectors element by element. R checks if the first element of
@@ -795,20 +797,20 @@ census_popn <- 35151728
 most_french <- 2669195
 ```
 
-We saw in section \@ref(filter-and) that 
+We saw in Section \@ref(filter-and) that 
 `r format(most_french, scientific = FALSE, big.mark = ",")` people reported 
 speaking French in Montréal as their primary language at home. 
 If we are interested in finding the official languages in regions 
 with higher numbers of people who speak it as their primary language at home 
-compared to French in Montréal then we can use `filter` to obtain rows 
+compared to French in Montréal, then we can use `filter` to obtain rows 
 where the value of `most_at_home` is greater than 
 `r format(most_french, scientific = FALSE, big.mark = ",")`.
 
 ``` {r}
 filter(official_langs, most_at_home > 2669195)
 ```
 
-`filter` returns a data frame with only one row indicating that when 
+`filter` returns a data frame with only one row, indicating that when 
 considering the official languages, 
 only English in Toronto is reported by more people 
 as their primary language at home 
@@ -818,7 +820,7 @@ than French in Montréal according to the 2016 Canadian census.
 
 ### Using `mutate` to modify columns
 In Section \@ref(separate), 
-when we first read in the `"region_lang_top5_cities_messy.csv"` data 
+when we first read in the `"region_lang_top5_cities_messy.csv"` data,
 all of the variables were "character" data types. \index{mutate}
 During the tidying process, 
 we used the `convert` argument from the `separate` function 
@@ -905,7 +907,7 @@ number, we need context. In particular, how many people were in Toronto when
 this data was collected? From the 2016 Canadian census profile, the population
 of Toronto was reported to be
 `r format(toronto_popn, scientific = FALSE, big.mark = ",")` people. 
-The number of people who report that English as their primary language at home 
+The number of people who report that English is their primary language at home 
 is much more meaningful when we report it in this context. 
 We can even go a step further and transform this count to a relative frequency 
 or proportion.
@@ -923,9 +925,9 @@ for our five cities of focus in this chapter.
 To accomplish this, we will need to do two tasks 
 beforehand:
 
-1. create a vector containing the population values for our cities
-2. filter the `official_langs` data frame 
-so that we only keep the rows where the language is English
+1. Create a vector containing the population values for our cities.
+2. Filter the `official_langs` data frame 
+so that we only keep the rows where the language is English.
 
 To create a vector containing the population values for our cities
 (Toronto, Montréal, Vancouver, Calgary, Edmonton),
@@ -963,7 +965,7 @@ same order as the cities were listed in the `english_langs` data frame.
 This is because R will perform the division computation we did by dividing 
 each element of the `most_at_home` column by each element of the 
 `city_pops` vector, matching them up by position.
-Failing to do this would have resulted in the incorrect math to be performed.
+Failing to do this would have resulted in the incorrect math being performed.
 
 > **Note:** In more advanced data wrangling, 
 > one might solve this problem in a less error-prone way though using 
@@ -1010,7 +1012,7 @@ frame. The basic ways of doing this can become quickly unreadable if there are
 many steps. For example, suppose we need to perform three operations on a data
 frame called `data`:  \index{pipe}\index{aaapipesymb@\vert{}>|see{pipe}}
 
-1)  add a new column `new_col` that is double another `old_col`
+1)  add a new column `new_col` that is double another `old_col`,
 2)  filter for rows where another column, `other_col`, is more than 5, and
 3)  select only the new column `new_col` for those rows.
 
@@ -1060,7 +1062,7 @@ output <- data |>
 
 > **Note:** You might also have noticed that we split the function calls across
 > lines after the pipe, similar to when we did this earlier in the chapter
-> for long function calls. Again this is allowed and recommended, especially when
+> for long function calls. Again, this is allowed and recommended, especially when
 > the piped function calls create a long line of code. Doing this makes
 > your code more readable. When you do this, it is important to end each line
 > with the pipe operator `|>` to tell R that your code is continuing onto the
@@ -1074,9 +1076,9 @@ output <- data |>
 > (which in turn imports the `magrittr` R package).
 > There are some other differences between `%>%` and `|>` related to 
 > more advanced R uses, such as sharing and distributing code as R packages, 
-> however these are beyond the scope of this textbook. 
+> however, these are beyond the scope of this textbook. 
 > We have this note in the book to make the reader aware that `%>%` exists
-> as it still commonly used in data analysis code and in many data science 
+> as it is still commonly used in data analysis code and in many data science 
 > books and other resources.
 > In most cases these two pipes are interchangeable and either can be used.
 
@@ -1112,7 +1114,7 @@ van_data_selected
 
 Although this is valid code, there is a more readable approach we could take by
 using the pipe, `|>`. With the pipe, we do not need to create an intermediate
-object to store the output from `filter`. Instead we can directly send the
+object to store the output from `filter`. Instead, we can directly send the
 output of `filter` to the input of `select`:
 
 ``` {r}
@@ -1131,12 +1133,12 @@ as the first argument for the function that comes after it.
 Therefore you do not specify the first argument in that function call. 
 In the code above,
 the first line is just the `tidy_lang` data frame with a pipe.
-The pipe passes the left hand side (`tidy_lang`) to the first argument of the function on the right (`filter`),
+The pipe passes the left-hand side (`tidy_lang`) to the first argument of the function on the right (`filter`),
 so in the `filter` function you only see the second argument (and beyond).
 Then again after `filter` there is a pipe, which passes the result of the `filter` step
 to the first argument of the `select` function.
 As you can see, both of these approaches&mdash;with and without pipes&mdash;give us the same output, but the second
-approach is more clear and readable.
+approach is clearer and more readable.
 
 ### Using `|>` with more than two functions
 
@@ -1152,8 +1154,7 @@ from smallest to largest.
 As we saw in Chapter \@ref(intro), 
 we can use the `tidyverse` `arrange` function \index{arrange}
 to order the rows in the data frame by the values of one or more columns. 
-Here we pass the column name `most_at_home` to arrange 
-to order the data frame rows by the values in that column, in ascending order.
+Here we pass the column name `most_at_home` to arrange the data frame rows by the values in that column, in ascending order.
 
 ``` {r}
 large_region_lang <- filter(tidy_lang, most_at_home > 10000) |>
@@ -1237,13 +1238,13 @@ lang_summary <- summarize(region_lang,
           max_most_at_home = max(most_at_home))
 ```
 
-From this we see that there are some languages in the data set the no one speaks
+From this we see that there are some languages in the data set that no one speaks
 as their primary language at home. We also see that the most commonly spoken
 primary language at home is spoken by
 `r format(lang_summary$max_most_at_home[1], scientific = FALSE, big.mark = ",")`
 people.
 
-#### Calculating summary statistics when there are `NA`s {-}
+### Calculating summary statistics when there are `NA`s
 
 In data frames in R, the value `NA` is often used to denote missing data. 
 Many of the base R statistical summary functions 
@@ -1272,7 +1273,7 @@ region_lang_na[["most_at_home"]][1] <- NA
 region_lang_na
 ```
 
-Now if we apply our summarize function as above, 
+Now if we apply our `summarize` function as above, 
 we see that no longer get the minimum and maximum returned, 
 but just an `NA` instead!
 
@@ -1299,7 +1300,7 @@ For example, we can use `group_by` to group the regions of the `tidy_lang` data
 reporting the language as the primary language at home 
 for each of the regions in the data set.
 
-(ref:summarize-groupby) `summarize` and `group_by` is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame&mdash;with one row for each group&mdash;containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The grey, blue and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example.
+(ref:summarize-groupby) `summarize` and `group_by` is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame&mdash;with one row for each group&mdash;containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The gray, blue, and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example.
 
 ```{r summarize-groupby, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:summarize-groupby)", fig.retina = 2, out.width = "100%"}
 image_read("img/summarize/summarize.002.jpeg") |> 
@@ -1320,7 +1321,7 @@ group_by(region_lang, region) |>
 ```
 
 Notice that `group_by` on its own doesn't change the way the data looks. 
-In the output below the grouped data set looks the same, 
+In the output below, the grouped data set looks the same, 
 and it doesn't *appear* to be grouped by `region`. 
 Instead, `group_by` simply changes how other functions work with the data, 
 as we saw with `summarize` above.  
@@ -1366,7 +1367,7 @@ region_lang |>
   summarize(across(mother_tongue:lang_known, max))
 ``` 
 
-> **Note:** Similarly to when we use base R statistical summary functions 
+> **Note:** Similar to when we use base R statistical summary functions 
 > (e.g., `max`, `min`, `mean`, `sum`, etc) with `summarize` alone, 
 > the use of the `summarize` + `across` functions paired 
 > with base R statistical summary functions
@@ -1392,7 +1393,8 @@ Let's again find the maximum value of each column of the
 `region_lang` data frame, but using `map` with the `max` function this time.
 `map` takes two arguments: 
 an object (a vector, data frame or list) that you want to apply the function to, 
-and the function that you would like to apply to each column.  
+and the function that you would like to apply to each column.
+
 Note that `map` does not have an argument 
 to specify *which* columns to apply the function to.
 Therefore, we will do this before calling `map` using the `select` function.
@@ -1422,7 +1424,7 @@ So what do we do? Should we convert this to a data frame? We could, but a
 simpler alternative is to just use a different `map` function. There 
 are quite a few to choose from, they all work similarly, but
 their name reflects the type of output you want from the mapping operation.
-Table \@ref(tab:map-table) lists the commonly-used `map` functions as well
+Table \@ref(tab:map-table) lists the commonly used `map` functions as well
 as their output type. \index{map!map\_\* functions}
 
 Table: (#tab:map-table) The `map` functions in R.
@@ -1446,14 +1448,16 @@ region_lang |>
   map_dfr(max)
 ```
 
-> **Note:** Similarly to when we use base R statistical summary functions 
+\newpage 
+
+> **Note:** Similar to when we use base R statistical summary functions 
 > (e.g., `max`, `min`, `mean`, `sum`, etc.) with `summarize`, 
 > `map` functions paired with base R statistical summary functions
 > also return `NA` values when we apply them to columns that 
 > contain `NA` values. \index{missing data}
 > 
 > To avoid this, again we need to add the argument `na.rm = TRUE`.
-> When we use this with `map` we do this by adding a `,` 
+> When we use this with `map`, we do this by adding a `,` 
 > and then `na.rm = TRUE` after specifying the function, as illustrated below:
 > 
 > ``` {r}
@@ -1545,7 +1549,7 @@ region_lang |>
 Now we apply `rowwise` before `mutate`, to tell R that we would like
 our mutate function to be applied across, and within, a row,
 as opposed to being applied on a column 
-(which is the default behaviour of `mutate`):
+(which is the default behavior of `mutate`):
 
 ```{r}
 region_lang |> 
@@ -1629,7 +1633,7 @@ found in Chapter \@ref(move-to-your-own-machine).
     To learn more about these functions and meet a few more useful
     functions, we recommend you check out [this
     chapter](http://stat545.com/block010_dplyr-end-single-table.html#where-were-we)
-    of the Data wrangling, exploration, and analysis with R book.
+    of the data wrangling, exploration, and analysis with R book.
   - The [`dplyr` page on the tidyverse website](https://dplyr.tidyverse.org/) is
     another resource to learn more about the functions in this
     chapter, the full set of arguments you can use, and other related functions.
@@ -1638,7 +1642,7 @@ found in Chapter \@ref(move-to-your-own-machine).
   - Check out the [tidyselect
     page](https://tidyselect.r-lib.org/reference/select_helpers.html) for a
     comprehensive list of `select` helpers.
-  - [R for Data Science](https://r4ds.had.co.nz/) has a few chapters related to
+  - [*R for Data Science*](https://r4ds.had.co.nz/) has a few chapters related to
     data wrangling that go into more depth than this book. For example, the
     [tidy data](https://r4ds.had.co.nz/tidy-data.html) chapter covers tidy data,
     `pivot_longer`/`pivot_wider` and `separate`, but also covers missing values