You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: intro.Rmd
+39-37Lines changed: 39 additions & 37 deletions
Original file line number
Diff line number
Diff line change
@@ -20,15 +20,15 @@ with data science!
20
20
21
21
## Chapter learning objectives
22
22
23
-
By the end of the chapter, readers will be able to:
23
+
By the end of the chapter, readers will be able to do the following:
24
24
25
-
-identify the different types of data analysis question and categorize a question into the correct type
26
-
-load the `tidyverse` package into R
27
-
-read tabular data with `read_csv`
28
-
-use`?` to access help and documentation tools in R
29
-
-create new variables and objects in R using the assignment symbol
30
-
-create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`
31
-
-visualize data with a `ggplot` bar plot
25
+
-Identify the different types of data analysis question and categorize a question into the correct type.
26
+
-Load the `tidyverse` package into R.
27
+
-Read tabular data with `read_csv`.
28
+
-Use`?` to access help and documentation tools in R.
29
+
-Create new variables and objects in R using the assignment symbol.
30
+
-Create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`.
31
+
-Visualize data with a `ggplot` bar plot.
32
32
33
33
## Canadian languages data set
34
34
@@ -42,11 +42,10 @@ individual learns in childhood) in Canadian residential schools. Colonizers
42
42
also renamed places they had "discovered" [@wilson2018]. Acts such as these
43
43
have significantly harmed the continuity of Indigenous languages in Canada, and
44
44
some languages are considered "endangered" as few people report speaking them.
45
-
To learn more, please see Canadian Geographic's article on
46
-
[Mapping Indigenous languages in Canada](https://www.canadiangeographic.ca/article/mapping-indigenous-languages-canada)
47
-
[@walker2017], [They Came for the Children: Canada, Aboriginal peoples, and Residential Schools](http://publications.gc.ca/site/archivee-archived.html?url=http://publications.gc.ca/collections/collection_2012/cvrc-trcc/IR4-4-2012-eng.pdf)[@children2012]
48
-
and the Truth and Reconciliation Commission of
49
-
Canada's [Calls to Action](http://trc.ca/assets/pdf/Calls_to_Action_English2.pdf)[@calls2015].
45
+
To learn more, please see *Canadian Geographic*'s article, ["Mapping Indigenous Languages in Canada"](https://www.canadiangeographic.ca/article/mapping-indigenous-languages-canada)
46
+
[@walker2017], [*They Came for the Children: Canada, Aboriginal peoples, and Residential Schools*](http://publications.gc.ca/site/archivee-archived.html?url=http://publications.gc.ca/collections/collection_2012/cvrc-trcc/IR4-4-2012-eng.pdf)[@children2012]
47
+
and the *Truth and Reconciliation Commission of
48
+
Canada's*[*Calls to Action*](http://trc.ca/assets/pdf/Calls_to_Action_English2.pdf)[@calls2015].
50
49
51
50
The data set we will study in this chapter is taken from
52
51
[the {canlang} R data package](https://ttimbers.github.io/canlang/)[@timbers2020canlang], which has
@@ -99,7 +98,7 @@ Table: (\#tab:questions-table) Types of data analysis question [@leek2015questio
| Descriptive | A question that asks about summarized characteristics of a data set without interpretation (i.e., report a fact). | How many people live in each province and territory in Canada? |
102
-
| Exploratory | A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
101
+
| Exploratory | A question that asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
103
102
| Predictive | A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. | What political party will someone vote for in the next Canadian election? |
104
103
| Inferential | A question that looks for patterns, trends, or relationships in a single data set **and** also asks for quantification of how applicable these findings are to the wider population. | Does political party voting change with indicators of wealth for all people living in Canada? |
105
104
| Causal | A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. | Does wealth lead to voting for a certain political party in Canadian elections? |
@@ -114,29 +113,29 @@ In particular, you will learn how to apply the following analysis tools:
114
113
Summarization is most often used to answer descriptive questions,
115
114
and can occasionally help with answering exploratory questions.
116
115
For example, you might use summarization to answer the following question:
117
-
*what is the average race time for runners in this data set?*
116
+
*What is the average race time for runners in this data set?*
118
117
Tools for summarization are covered in detail in Chapters \@ref(reading)
119
118
and \@ref(wrangling), but appear regularly throughout the text.
120
119
2.**Visualization:** \index{visualization!overview} plotting data graphically.
121
120
Visualization is typically used to answer descriptive and exploratory questions,
122
121
but plays a critical supporting role in answering all of the types of question in Table \@ref(tab:questions-table).
123
122
For example, you might use visualization to answer the following question:
124
-
*is there any relationship between race time and age for runners in this data set?*
123
+
*Is there any relationship between race time and age for runners in this data set?*
125
124
This is covered in detail in Chapter \@ref(viz), but again appears regularly throughout the book.
126
125
3.**Classification:** \index{classification!overview} predicting a class or category for a new observation.
127
126
Classification is used to answer predictive questions.
128
127
For example, you might use classification to answer the following question:
129
-
*given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?*
128
+
*Given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?*
130
129
Classification is covered in Chapters \@ref(classification) and \@ref(classification2).
131
130
4.**Regression:** \index{regression!overview} predicting a quantitative value for a new observation.
132
131
Regression is also used to answer predictive questions.
133
132
For example, you might use regression to answer the following question:
134
-
*what will be the race time for a 20-year-old runner who weighs 50kg?*
133
+
*What will be the race time for a 20-year-old runner who weighs 50kg?*
135
134
Regression is covered in Chapters \@ref(regression1) and \@ref(regression2).
136
-
5.**Clustering:** \index{clustering!overview} finding previously unknown/unlabelled subgroups in a
137
-
dataset. Clustering is often used to answer exploratory questions.
135
+
5.**Clustering:** \index{clustering!overview} finding previously unknown/unlabeled subgroups in a
136
+
data set. Clustering is often used to answer exploratory questions.
138
137
For example, you might use clustering to answer the following question:
139
-
*what products are commonly bought together on Amazon?*
138
+
*What products are commonly bought together on Amazon?*
140
139
Clustering is covered in Chapter \@ref(clustering).
141
140
6.**Estimation:** \index{estimation!overview} taking measurements for a small number of items from a large group
142
141
and making a good guess for the average or proportion for the large group. Estimation
@@ -231,8 +230,8 @@ library(tidyverse)
231
230
> line. These are examples of *messages* in R, which give the user more
232
231
> information that might be handy to know. The `Attaching packages` message is
233
232
> natural when loading `tidyverse`, since `tidyverse` actually automatically
234
-
> causes other packages to be imported too, such as `dplyr`. In the future
235
-
> when we load `tidyverse` in this book we will silence these messages to help
233
+
> causes other packages to be imported too, such as `dplyr`. In the future,
234
+
> when we load `tidyverse` in this book, we will silence these messages to help
236
235
> with the readability of the book. The `Conflicts` message is also totally normal
237
236
> in this circumstance. This message tells you if functions from different
238
237
> packages share the same name, which is confusing to R. For example, in this
@@ -301,8 +300,8 @@ Note that when
301
300
we name something in R using the assignment symbol, `<-`,
302
301
we do not need to surround the name we are creating with quotes. This is
303
302
because we are formally telling R that this special word denotes
304
-
the value of whatever is on the righthand side.
305
-
Only characters and words that act as *values* on the righthand side of the assignment
303
+
the value of whatever is on the right-hand side.
304
+
Only characters and words that act as *values* on the right-hand side of the assignment
306
305
symbol—e.g., the file name `"data/can_lang.csv"` that we specified before, or `"Alice"` above—need
307
306
to be surrounded by quotes.
308
307
@@ -327,14 +326,14 @@ Error: unexpected assignment in "na+me <-"
327
326
```
328
327
329
328
There are certain conventions for naming objects in R. When naming \index{object!naming convention} an object we
330
-
suggest using only lower case letters, numbers and underscores `_` to separate
329
+
suggest using only lowercase letters, numbers and underscores `_` to separate
331
330
the words in a name. R is case sensitive, which means that `Letter` and
332
331
`letter` would be two different objects in R. You should also try to give your
333
332
objects meaningful names. For instance, you *can* name a data frame `x`.
334
333
However, using more meaningful terms, such as `language_data`, will help you
335
334
remember what each name in your code represents. We recommend following the
336
-
Tidyverse naming conventions outlined in the [Tidyverse Style
Figure \@ref(fig:barplot-mother-tongue-reorder) provides a very clear and well-organized
606
605
answer to our original question; we can see what the ten most often reported Aboriginal languages
607
-
were, according to the 2016 Candian census, and how many people speak each of them. For
606
+
were, according to the 2016 Canadian census, and how many people speak each of them. For
608
607
instance, we can see that the Aboriginal language most often reported was Cree
609
608
n.o.s. with over 60,000 Canadian residents reporting it as their mother tongue.
610
609
@@ -661,7 +660,7 @@ these steps in much more detail!
661
660
662
661
There are many R functions in the `tidyverse` package (and beyond!), and
663
662
nobody can be expected to remember what every one of them does
664
-
nor all of the arguments we have to give them. Fortunately R provides
663
+
or all of the arguments we have to give them. Fortunately, R provides
665
664
the `?` symbol, which
666
665
\index{aaaquestionmark@?|see{documentation}}
667
666
\index{help|see{documentation}}
@@ -687,7 +686,10 @@ documentation like that shown in Figure \@ref(fig:01-help). But do keep in mind
687
686
is not written to *teach* you about a function; it is just there as a reference to *remind*
688
687
you about the different arguments and usage of functions that you have already learned about elsewhere.
689
688
690
-
```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2, out.width="80%"}
689
+
(ref:01-help) The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.
0 commit comments