Skip to content

Commit b92e241

Browse files
Merge pull request #383 from UBC-DSCI/intro
intro copyedit pass
2 parents b30d8ef + 674fad8 commit b92e241

File tree

1 file changed

+39
-37
lines changed

1 file changed

+39
-37
lines changed

intro.Rmd

Lines changed: 39 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,15 @@ with data science!
2020

2121
## Chapter learning objectives
2222

23-
By the end of the chapter, readers will be able to:
23+
By the end of the chapter, readers will be able to do the following:
2424

25-
- identify the different types of data analysis question and categorize a question into the correct type
26-
- load the `tidyverse` package into R
27-
- read tabular data with `read_csv`
28-
- use `?` to access help and documentation tools in R
29-
- create new variables and objects in R using the assignment symbol
30-
- create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`
31-
- visualize data with a `ggplot` bar plot
25+
- Identify the different types of data analysis question and categorize a question into the correct type.
26+
- Load the `tidyverse` package into R.
27+
- Read tabular data with `read_csv`.
28+
- Use `?` to access help and documentation tools in R.
29+
- Create new variables and objects in R using the assignment symbol.
30+
- Create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`.
31+
- Visualize data with a `ggplot` bar plot.
3232

3333
## Canadian languages data set
3434

@@ -42,11 +42,10 @@ individual learns in childhood) in Canadian residential schools. Colonizers
4242
also renamed places they had "discovered" [@wilson2018]. Acts such as these
4343
have significantly harmed the continuity of Indigenous languages in Canada, and
4444
some languages are considered "endangered" as few people report speaking them.
45-
To learn more, please see Canadian Geographic's article on
46-
[Mapping Indigenous languages in Canada](https://www.canadiangeographic.ca/article/mapping-indigenous-languages-canada)
47-
[@walker2017], [They Came for the Children: Canada, Aboriginal peoples, and Residential Schools](http://publications.gc.ca/site/archivee-archived.html?url=http://publications.gc.ca/collections/collection_2012/cvrc-trcc/IR4-4-2012-eng.pdf) [@children2012]
48-
and the Truth and Reconciliation Commission of
49-
Canada's [Calls to Action](http://trc.ca/assets/pdf/Calls_to_Action_English2.pdf) [@calls2015].
45+
To learn more, please see *Canadian Geographic*'s article, ["Mapping Indigenous Languages in Canada"](https://www.canadiangeographic.ca/article/mapping-indigenous-languages-canada)
46+
[@walker2017], [*They Came for the Children: Canada, Aboriginal peoples, and Residential Schools*](http://publications.gc.ca/site/archivee-archived.html?url=http://publications.gc.ca/collections/collection_2012/cvrc-trcc/IR4-4-2012-eng.pdf) [@children2012]
47+
and the *Truth and Reconciliation Commission of
48+
Canada's* [*Calls to Action*](http://trc.ca/assets/pdf/Calls_to_Action_English2.pdf) [@calls2015].
5049

5150
The data set we will study in this chapter is taken from
5251
[the {canlang} R data package](https://ttimbers.github.io/canlang/) [@timbers2020canlang], which has
@@ -99,7 +98,7 @@ Table: (\#tab:questions-table) Types of data analysis question [@leek2015questio
9998
|Question type| Description | Example |
10099
|-------------|------------------------|--------------------|
101100
| Descriptive | A question that asks about summarized characteristics of a data set without interpretation (i.e., report a fact). | How many people live in each province and territory in Canada? |
102-
| Exploratory | A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
101+
| Exploratory | A question that asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
103102
| Predictive | A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. | What political party will someone vote for in the next Canadian election? |
104103
| Inferential | A question that looks for patterns, trends, or relationships in a single data set **and** also asks for quantification of how applicable these findings are to the wider population. | Does political party voting change with indicators of wealth for all people living in Canada? |
105104
| Causal | A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. | Does wealth lead to voting for a certain political party in Canadian elections? |
@@ -114,29 +113,29 @@ In particular, you will learn how to apply the following analysis tools:
114113
Summarization is most often used to answer descriptive questions,
115114
and can occasionally help with answering exploratory questions.
116115
For example, you might use summarization to answer the following question:
117-
*what is the average race time for runners in this data set?*
116+
*What is the average race time for runners in this data set?*
118117
Tools for summarization are covered in detail in Chapters \@ref(reading)
119118
and \@ref(wrangling), but appear regularly throughout the text.
120119
2. **Visualization:** \index{visualization!overview} plotting data graphically.
121120
Visualization is typically used to answer descriptive and exploratory questions,
122121
but plays a critical supporting role in answering all of the types of question in Table \@ref(tab:questions-table).
123122
For example, you might use visualization to answer the following question:
124-
*is there any relationship between race time and age for runners in this data set?*
123+
*Is there any relationship between race time and age for runners in this data set?*
125124
This is covered in detail in Chapter \@ref(viz), but again appears regularly throughout the book.
126125
3. **Classification:** \index{classification!overview} predicting a class or category for a new observation.
127126
Classification is used to answer predictive questions.
128127
For example, you might use classification to answer the following question:
129-
*given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?*
128+
*Given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?*
130129
Classification is covered in Chapters \@ref(classification) and \@ref(classification2).
131130
4. **Regression:** \index{regression!overview} predicting a quantitative value for a new observation.
132131
Regression is also used to answer predictive questions.
133132
For example, you might use regression to answer the following question:
134-
*what will be the race time for a 20-year-old runner who weighs 50kg?*
133+
*What will be the race time for a 20-year-old runner who weighs 50kg?*
135134
Regression is covered in Chapters \@ref(regression1) and \@ref(regression2).
136-
5. **Clustering:** \index{clustering!overview} finding previously unknown/unlabelled subgroups in a
137-
dataset. Clustering is often used to answer exploratory questions.
135+
5. **Clustering:** \index{clustering!overview} finding previously unknown/unlabeled subgroups in a
136+
data set. Clustering is often used to answer exploratory questions.
138137
For example, you might use clustering to answer the following question:
139-
*what products are commonly bought together on Amazon?*
138+
*What products are commonly bought together on Amazon?*
140139
Clustering is covered in Chapter \@ref(clustering).
141140
6. **Estimation:** \index{estimation!overview} taking measurements for a small number of items from a large group
142141
and making a good guess for the average or proportion for the large group. Estimation
@@ -231,8 +230,8 @@ library(tidyverse)
231230
> line. These are examples of *messages* in R, which give the user more
232231
> information that might be handy to know. The `Attaching packages` message is
233232
> natural when loading `tidyverse`, since `tidyverse` actually automatically
234-
> causes other packages to be imported too, such as `dplyr`. In the future
235-
> when we load `tidyverse` in this book we will silence these messages to help
233+
> causes other packages to be imported too, such as `dplyr`. In the future,
234+
> when we load `tidyverse` in this book, we will silence these messages to help
236235
> with the readability of the book. The `Conflicts` message is also totally normal
237236
> in this circumstance. This message tells you if functions from different
238237
> packages share the same name, which is confusing to R. For example, in this
@@ -301,8 +300,8 @@ Note that when
301300
we name something in R using the assignment symbol, `<-`,
302301
we do not need to surround the name we are creating with quotes. This is
303302
because we are formally telling R that this special word denotes
304-
the value of whatever is on the right hand side.
305-
Only characters and words that act as *values* on the right hand side of the assignment
303+
the value of whatever is on the right-hand side.
304+
Only characters and words that act as *values* on the right-hand side of the assignment
306305
symbol&mdash;e.g., the file name `"data/can_lang.csv"` that we specified before, or `"Alice"` above&mdash;need
307306
to be surrounded by quotes.
308307

@@ -327,14 +326,14 @@ Error: unexpected assignment in "na+me <-"
327326
```
328327

329328
There are certain conventions for naming objects in R. When naming \index{object!naming convention} an object we
330-
suggest using only lower case letters, numbers and underscores `_` to separate
329+
suggest using only lowercase letters, numbers and underscores `_` to separate
331330
the words in a name. R is case sensitive, which means that `Letter` and
332331
`letter` would be two different objects in R. You should also try to give your
333332
objects meaningful names. For instance, you *can* name a data frame `x`.
334333
However, using more meaningful terms, such as `language_data`, will help you
335334
remember what each name in your code represents. We recommend following the
336-
Tidyverse naming conventions outlined in the [Tidyverse Style
337-
Guide](https://principles.tidyverse.org/names-attribute.html#universal-names)
335+
Tidyverse naming conventions outlined in the [*Tidyverse Style
336+
Guide*](https://principles.tidyverse.org/names-attribute.html#universal-names)
338337
[@tidyversestyleguide]. Let's now use the assignment symbol to give the name
339338
`can_lang` to the 2016 Canadian census language data frame that we get from
340339
`read_csv`.
@@ -386,11 +385,11 @@ For example, in our analysis, we are interested in keeping only languages in the
386385
"Aboriginal languages" higher-level category. We can use
387386
the *equivalency operator* `==` \index{logical statement!equivalency operator} to compare the values
388387
of the `category` column with the value `"Aboriginal languages"`; you will learn about
389-
many other kinds of logical statement in Chapter \@ref(wrangling). Similar to
390-
when we loaded the data file and put quotes around the filename, here we need
388+
many other kinds of logical statements in Chapter \@ref(wrangling). Similar to
389+
when we loaded the data file and put quotes around the file name, here we need
391390
to put quotes around `"Aboriginal languages"`. Using quotes tells R that this
392391
is a string *value* \index{string} and not one of the special words that make up R
393-
programming language, nor one of the names we have given to data frames in the
392+
programming language, or one of the names we have given to data frames in the
394393
code we have already written.
395394

396395
(ref:img-filter) Syntax for the `filter` function.
@@ -535,7 +534,7 @@ ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
535534
> However, there *are* a small number of situations in which you can have a
536535
> single R expression span multiple lines. Above is one such case: here, R knows that a line cannot
537536
> end with a `+` symbol, \index{aaaplussymb@$+$|see{ggplot (add layer)}} and so it keeps reading the next line to figure out
538-
> what the right hand side of the `+` symbol should be. We could, of course,
537+
> what the right-hand side of the `+` symbol should be. We could, of course,
539538
> put all of the added layers on one line of code, but splitting them across
540539
> multiple lines helps a lot with code readability. \index{multi-line expression}
541540
@@ -549,7 +548,7 @@ column names do not have enough information about the variable in the column.
549548
We really should replace this default with a more informative label. For the
550549
example above, R uses the column name `mother_tongue` as the label for the
551550
y axis, but most people will not know what that is. And even if they did, they
552-
will not know how we measure this variable, nor which group of people the
551+
will not know how we measured this variable, or the group of people on which the
553552
measurements were taken. An axis label that reads "Mother Tongue (Number of
554553
Canadian Residents)" would be much more informative.
555554

@@ -560,7 +559,7 @@ use the `xlab` (short for x axis label) and `ylab` (short for y axis label) func
560559
to add layers where we specify meaningful
561560
and informative labels for the x and y axes. \index{plot!axis labels} Again, since we are specifying
562561
words (e.g. `"Mother Tongue (Number of Canadian Residents)"`) as arguments to
563-
`xlab` and `ylab`, we surround them with double-quotes. We can add many more
562+
`xlab` and `ylab`, we surround them with double quotation marks. We can add many more
564563
layers to format the plot further, and we will explore these in Chapter
565564
\@ref(viz).
566565

@@ -604,7 +603,7 @@ ggplot(ten_lang, aes(x = mother_tongue,
604603

605604
Figure \@ref(fig:barplot-mother-tongue-reorder) provides a very clear and well-organized
606605
answer to our original question; we can see what the ten most often reported Aboriginal languages
607-
were, according to the 2016 Candian census, and how many people speak each of them. For
606+
were, according to the 2016 Canadian census, and how many people speak each of them. For
608607
instance, we can see that the Aboriginal language most often reported was Cree
609608
n.o.s. with over 60,000 Canadian residents reporting it as their mother tongue.
610609

@@ -661,7 +660,7 @@ these steps in much more detail!
661660

662661
There are many R functions in the `tidyverse` package (and beyond!), and
663662
nobody can be expected to remember what every one of them does
664-
nor all of the arguments we have to give them. Fortunately R provides
663+
or all of the arguments we have to give them. Fortunately, R provides
665664
the `?` symbol, which
666665
\index{aaaquestionmark@?|see{documentation}}
667666
\index{help|see{documentation}}
@@ -687,7 +686,10 @@ documentation like that shown in Figure \@ref(fig:01-help). But do keep in mind
687686
is not written to *teach* you about a function; it is just there as a reference to *remind*
688687
you about the different arguments and usage of functions that you have already learned about elsewhere.
689688

690-
```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2, out.width="80%"}
689+
(ref:01-help) The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.
690+
691+
692+
```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:01-help)", fig.retina = 2, out.width="80%"}
691693
knitr::include_graphics("img/help-filter.png")
692694
```
693695

0 commit comments

Comments
 (0)