Merge pull request #383 from UBC-DSCI/intro

trevorcampbell · web-flow · commit b92e241a3cca · 2021-12-02T20:31:13.000-08:00
intro copyedit pass
diff --git a/intro.Rmd b/intro.Rmd
@@ -20,15 +20,15 @@ with data science!
 
 ## Chapter learning objectives
 
-By the end of the chapter, readers will be able to:
+By the end of the chapter, readers will be able to do the following:
 
-- identify the different types of data analysis question and categorize a question into the correct type
-- load the `tidyverse` package into R
-- read tabular data with `read_csv`
-- use `?` to access help and documentation tools in R
-- create new variables and objects in R using the assignment symbol
-- create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`
-- visualize data with a `ggplot` bar plot
+- Identify the different types of data analysis question and categorize a question into the correct type.
+- Load the `tidyverse` package into R.
+- Read tabular data with `read_csv`.
+- Use `?` to access help and documentation tools in R.
+- Create new variables and objects in R using the assignment symbol.
+- Create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`.
+- Visualize data with a `ggplot` bar plot.
 
 ## Canadian languages data set
 
@@ -42,11 +42,10 @@ individual learns in childhood) in Canadian residential schools. Colonizers
 also renamed places they had "discovered" [@wilson2018].  Acts such as these
 have significantly harmed the continuity of Indigenous languages in Canada, and
 some languages are considered "endangered" as few people report speaking them. 
-To learn more, please see Canadian Geographic's article on 
-[Mapping Indigenous languages in Canada](https://www.canadiangeographic.ca/article/mapping-indigenous-languages-canada)
-[@walker2017], [They Came for the Children: Canada, Aboriginal peoples, and Residential Schools](http://publications.gc.ca/site/archivee-archived.html?url=http://publications.gc.ca/collections/collection_2012/cvrc-trcc/IR4-4-2012-eng.pdf) [@children2012] 
-and the Truth and Reconciliation Commission of 
-Canada's [Calls to Action](http://trc.ca/assets/pdf/Calls_to_Action_English2.pdf) [@calls2015].
+To learn more, please see *Canadian Geographic*'s article, ["Mapping Indigenous Languages in Canada"](https://www.canadiangeographic.ca/article/mapping-indigenous-languages-canada)
+[@walker2017], [*They Came for the Children: Canada, Aboriginal peoples, and Residential Schools*](http://publications.gc.ca/site/archivee-archived.html?url=http://publications.gc.ca/collections/collection_2012/cvrc-trcc/IR4-4-2012-eng.pdf) [@children2012] 
+and the *Truth and Reconciliation Commission of 
+Canada's* [*Calls to Action*](http://trc.ca/assets/pdf/Calls_to_Action_English2.pdf) [@calls2015].
 
 The data set we will study in this chapter is taken from 
 [the {canlang} R data package](https://ttimbers.github.io/canlang/) [@timbers2020canlang], which has
@@ -99,7 +98,7 @@ Table: (\#tab:questions-table) Types of data analysis question [@leek2015questio
 |Question type|    Description         |     Example        |
 |-------------|------------------------|--------------------|
 | Descriptive | A question that asks about summarized characteristics of a data set without interpretation (i.e., report a fact). | How many people live in each province and territory in Canada? |
-| Exploratory | A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
+| Exploratory | A question that asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
 | Predictive | A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. | What political party will someone vote for in the next Canadian election? |
 | Inferential | A question that looks for patterns, trends, or relationships in a single data set **and** also asks for quantification of how applicable these findings are to the wider population. | Does political party voting change with indicators of wealth for all people living in Canada? |
 | Causal | A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. | Does wealth lead to voting for a certain political party in Canadian elections? |
@@ -114,29 +113,29 @@ In particular, you will learn how to apply the following analysis tools:
 Summarization is most often used to answer descriptive questions,
 and can occasionally help with answering exploratory questions.
 For example, you might use summarization to answer the following question: 
-*what is the average race time for runners in this data set?*
+*What is the average race time for runners in this data set?*
 Tools for summarization are covered in detail in Chapters \@ref(reading)
 and \@ref(wrangling), but appear regularly throughout the text.
 2. **Visualization:** \index{visualization!overview} plotting data graphically. 
 Visualization is typically used to answer descriptive and exploratory questions,
 but plays a critical supporting role in answering all of the types of question in Table \@ref(tab:questions-table).
 For example, you might use visualization to answer the following question:
-*is there any relationship between race time and age for runners in this data set?* 
+*Is there any relationship between race time and age for runners in this data set?* 
 This is covered in detail in Chapter \@ref(viz), but again appears regularly throughout the book.
 3. **Classification:** \index{classification!overview} predicting a class or category for a new observation.
 Classification is used to answer predictive questions.
 For example, you might use classification to answer the following question:
-*given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?*
+*Given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?*
 Classification is covered in Chapters \@ref(classification) and \@ref(classification2).
 4. **Regression:**  \index{regression!overview} predicting a quantitative value for a new observation. 
 Regression is also used to answer predictive questions.
 For example, you might use regression to answer the following question:
-*what will be the race time for a 20-year-old runner who weighs 50kg?*
+*What will be the race time for a 20-year-old runner who weighs 50kg?*
 Regression is covered in Chapters \@ref(regression1) and \@ref(regression2).
-5. **Clustering:** \index{clustering!overview} finding previously unknown/unlabelled subgroups in a
-dataset. Clustering is often used to answer exploratory questions.
+5. **Clustering:** \index{clustering!overview} finding previously unknown/unlabeled subgroups in a
+data set. Clustering is often used to answer exploratory questions.
 For example, you might use clustering to answer the following question:
-*what products are commonly bought together on Amazon?*
+*What products are commonly bought together on Amazon?*
 Clustering is covered in Chapter \@ref(clustering).
 6. **Estimation:**  \index{estimation!overview} taking measurements for a small number of items from a large group 
  and making a good guess for the average or proportion for the large group. Estimation 
@@ -231,8 +230,8 @@ library(tidyverse)
 > line. These are examples of *messages* in R, which give the user more
 > information that might be handy to know. The `Attaching packages` message is
 > natural when loading `tidyverse`, since `tidyverse` actually automatically
-> causes other packages to be imported too, such as `dplyr`.  In the future
-> when we load `tidyverse` in this book we will silence these messages to help
+> causes other packages to be imported too, such as `dplyr`.  In the future,
+> when we load `tidyverse` in this book, we will silence these messages to help
 > with the readability of the book.  The `Conflicts` message is also totally normal
 > in this circumstance.  This message tells you if functions from different
 > packages share the same name, which is confusing to R.  For example, in this
@@ -301,8 +300,8 @@ Note that when
 we name something in R using the assignment symbol, `<-`, 
 we do not need to surround the name we are creating  with quotes. This is 
 because we are formally telling R that this special word denotes
-the value of whatever is on the right hand side.
-Only characters and words that act as *values* on the right hand side of the assignment
+the value of whatever is on the right-hand side.
+Only characters and words that act as *values* on the right-hand side of the assignment
 symbol&mdash;e.g., the file name `"data/can_lang.csv"` that we specified before, or `"Alice"` above&mdash;need 
 to be surrounded by quotes.
 
@@ -327,14 +326,14 @@ Error: unexpected assignment in "na+me <-"
 ```
 
 There are certain conventions for naming objects in R. When naming \index{object!naming convention} an object we
-suggest using only lower case letters, numbers and underscores `_` to separate
+suggest using only lowercase letters, numbers and underscores `_` to separate
 the words in a name.  R is case sensitive, which means that `Letter` and
 `letter` would be two different objects in R.  You should also try to give your
 objects meaningful names.  For instance, you *can* name a data frame `x`.
 However, using more meaningful terms, such as `language_data`, will help you
 remember what each name in your code represents.  We recommend following the
-Tidyverse naming conventions outlined in the [Tidyverse Style
-Guide](https://principles.tidyverse.org/names-attribute.html#universal-names)
+Tidyverse naming conventions outlined in the [*Tidyverse Style
+Guide*](https://principles.tidyverse.org/names-attribute.html#universal-names)
 [@tidyversestyleguide].  Let's now use the assignment symbol to give the name
 `can_lang` to the 2016 Canadian census language data frame that we get from
 `read_csv`. 
@@ -386,11 +385,11 @@ For example, in our analysis, we are interested in keeping only languages in the
 "Aboriginal languages" higher-level category. We can use 
 the *equivalency operator* `==` \index{logical statement!equivalency operator} to compare the values
 of the `category` column with the value `"Aboriginal languages"`; you will learn about
-many other kinds of logical statement in Chapter \@ref(wrangling).  Similar to
-when we loaded the data file and put quotes around the filename, here we need
+many other kinds of logical statements in Chapter \@ref(wrangling).  Similar to
+when we loaded the data file and put quotes around the file name, here we need
 to put quotes around `"Aboriginal languages"`. Using quotes tells R that this
 is a string *value* \index{string} and not one of the special words that make up R
-programming language, nor one of the names we have given to data frames in the
+programming language, or one of the names we have given to data frames in the
 code we have already written. 
 
 (ref:img-filter) Syntax for the `filter` function.
@@ -535,7 +534,7 @@ ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
 > However, there *are* a small number of situations in which you can have a
 > single R expression span multiple lines. Above is one such case: here, R knows that a line cannot
 > end with a `+` symbol, \index{aaaplussymb@$+$|see{ggplot (add layer)}} and so it keeps reading the next line to figure out
-> what the right hand side of the `+` symbol should be.  We could, of course,
+> what the right-hand side of the `+` symbol should be.  We could, of course,
 > put all of the added layers on one line of code, but splitting them across
 > multiple lines helps a lot with code readability. \index{multi-line expression}
 
@@ -549,7 +548,7 @@ column names do not have enough information about the variable in the column.
 We really should replace this default with a more informative label. For the
 example above, R uses the column name `mother_tongue` as the label for the
 y axis, but most people will not know what that is. And even if they did, they
-will not know how we measure this variable, nor which group of people the
+will not know how we measured this variable, or the group of people on which the
 measurements were taken. An axis label that reads "Mother Tongue (Number of
 Canadian Residents)" would be much more informative.
 
@@ -560,7 +559,7 @@ use the `xlab` (short for x axis label) and `ylab` (short for y axis label) func
 to add layers where we specify meaningful
 and informative labels for the x and y axes. \index{plot!axis labels} Again, since we are specifying
 words (e.g. `"Mother Tongue (Number of Canadian Residents)"`) as arguments to
-`xlab` and `ylab`, we surround them with double-quotes. We can add many more
+`xlab` and `ylab`, we surround them with double quotation marks. We can add many more
 layers to format the plot further, and we will explore these in Chapter
 \@ref(viz).
 
@@ -604,7 +603,7 @@ ggplot(ten_lang, aes(x = mother_tongue,
 
 Figure \@ref(fig:barplot-mother-tongue-reorder) provides a very clear and well-organized
 answer to our original question; we can see what the ten most often reported Aboriginal languages
-were, according to the 2016 Candian census, and how many people speak each of them. For
+were, according to the 2016 Canadian census, and how many people speak each of them. For
 instance, we can see that the Aboriginal language most often reported was Cree
 n.o.s. with over 60,000 Canadian residents reporting it as their mother tongue.
 
@@ -661,7 +660,7 @@ these steps in much more detail!
 
 There are many R functions in the `tidyverse` package (and beyond!), and 
 nobody can be expected to remember what every one of them does
-nor all of the arguments we have to give them. Fortunately R provides 
+or all of the arguments we have to give them. Fortunately, R provides 
 the `?` symbol, which 
 \index{aaaquestionmark@?|see{documentation}}
 \index{help|see{documentation}}
@@ -687,7 +686,10 @@ documentation like that shown in Figure \@ref(fig:01-help). But do keep in mind
 is not written to *teach* you about a function; it is just there as a reference to *remind*
 you about the different arguments and usage of functions that you have already learned about elsewhere.
 
-```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2, out.width="80%"}
+(ref:01-help) The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.
+
+
+```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:01-help)", fig.retina = 2, out.width="80%"}
 knitr::include_graphics("img/help-filter.png")
 ```