Skip to content

Commit d910de9

Browse files
Merge pull request #435 from UBC-DSCI/ch1-4
Ch1 4
2 parents b12e5de + 1a9bbcf commit d910de9

File tree

6 files changed

+81
-73
lines changed

6 files changed

+81
-73
lines changed

img/pivot_functions.key

-508 Bytes
Binary file not shown.
96 Bytes
Loading

intro.Rmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ By the end of the chapter, readers will be able to do the following:
2525
- Identify the different types of data analysis question and categorize a question into the correct type.
2626
- Load the `tidyverse` package into R.
2727
- Read tabular data with `read_csv`.
28-
- Use `?` to access help and documentation tools in R.
2928
- Create new variables and objects in R using the assignment symbol.
3029
- Create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`.
3130
- Visualize data with a `ggplot` bar plot.
31+
- Use `?` to access help and documentation tools in R.
3232

3333
## Canadian languages data set
3434

@@ -312,7 +312,7 @@ to be surrounded by quotes.
312312
After making the assignment, we can use the special name words we have created in
313313
place of their values. For example, if we want to do something with the value `3` later on,
314314
we can just use `my_number` instead. Let's try adding 2 to `my_number`; you will see that
315-
R just interprets this as adding 2 and 3:
315+
R just interprets this as adding 3 and 2:
316316
```{r naming-things2}
317317
my_number + 2
318318
```
@@ -374,7 +374,7 @@ Aboriginal languages in the data set, and then use `select` to obtain only the
374374
columns we want to include in our table.
375375

376376
### Using `filter` to extract rows
377-
Looking at the `can_lang` data above, we see the column `category` contains different
377+
Looking at the `can_lang` data above, we see the `category` column contains different
378378
high-level categories of languages, which include "Aboriginal languages",
379379
"Non-Official & Non-Aboriginal languages" and "Official languages". To answer
380380
our question we want to filter our data set so we restrict our attention
@@ -528,7 +528,7 @@ image_read("img/ggplot_function.jpeg") |>
528528
image_crop("1625x1900")
529529
```
530530

531-
```{r barplot-mother-tongue, fig.width=5, fig.height=3, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
531+
```{r barplot-mother-tongue, fig.width=5, fig.height=3.1, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
532532
ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
533533
geom_bar(stat = "identity")
534534
```

reading.Rmd

Lines changed: 36 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -110,11 +110,12 @@ happy_data <- read_csv("data/happiness_report.csv")
110110
happy_data <- read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
111111
```
112112

113-
So which one should you use? Generally speaking, to ensure your code can be run
114-
on a different computer, you should use relative paths. An added bonus is that
115-
it's also less typing! Generally, you should use relative paths because the file's
116-
absolute path (the names of
117-
folders between the computer's root `/` and the file) isn't usually the same
113+
So which one should you use? Generally speaking, you should use relative paths.
114+
Using a relative path helps ensure that your code can be run
115+
on a different computer (and as an added bonus, relative paths are often shorter&mdash;easier to type!).
116+
This is because a file's relative path is often the same across different computers, while a
117+
file's absolute path (the names of
118+
all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same
118119
across different computers. For example, suppose Fatima and Jayden are working on a
119120
project together on the `happiness_report.csv` data. Fatima's file is stored at
120121

@@ -130,18 +131,17 @@ their different usernames. If Jayden has code that loads the
130131
`happiness_report.csv` data using an absolute path, the code won't work on
131132
Fatima's computer. But the relative path from inside the `project` folder
132133
(`data/happiness_report.csv`) is the same on both computers; any code that uses
133-
relative paths will work on both!
134-
135-
In the additional resources section, we include a link to a short video on the
134+
relative paths will work on both! In the additional resources section,
135+
we include a link to a short video on the
136136
difference between absolute and relative paths. You can also check out the
137137
`here` package, which provides methods for finding and constructing file paths
138138
in R.
139139

140-
Your file could be stored locally, as we discussed, or it could also be
141-
somewhere on the internet (remotely). A *Uniform Resource Locator (URL)* (web
142-
address) \index{URL} indicates the location of a resource on the internet and
143-
helps us retrieve that resource. Next, we will discuss how to get either
144-
locally or remotely stored data into R.
140+
Beyond files stored on your computer (i.e., locally), we also need a way to locate resources
141+
stored elsewhere on the internet (i.e., remotely). For this purpose we use a *Uniform Resource Locator (URL)*,
142+
i.e., a web address that looks something like https://datasciencebook.ca/. \index{URL}
143+
URLs indicate the location of a resource on the internet and
144+
help us retrieve that resource.
145145

146146
## Reading tabular data from a plain text file into R
147147

@@ -152,7 +152,7 @@ to import data into R using various functions. Specifically, we will learn how
152152
to *read* tabular data from a plain text file (a document containing only text)
153153
*into* R and *write* tabular data to a file *out of* R. The function we use to do this
154154
depends on the file's format. For example, in the last chapter, we learned about using
155-
the `tidyverse` `read_csv` function when reading .csv (**c**omma-**s**eparated **v**alues)
155+
the `tidyverse` `read_csv` function when reading `.csv` (**c**omma-**s**eparated **v**alues)
156156
files. \index{csv} In that case, the separator or *delimiter* \index{reading!delimiter} that divided our columns was a
157157
comma (`,`). We only learned the case where the data matched the expected defaults
158158
of the `read_csv` function \index{read function!read\_csv}
@@ -168,9 +168,7 @@ language data from the 2016 Canadian census. \index{Canadian languages!canlang d
168168
We put `data/` before the file's
169169
name when we are loading the data set because this data set is located in a
170170
sub-folder, named `data`, relative to where we are running our R code.
171-
172-
Here is what the file would look like in a plain text editor (a program that removes
173-
all formatting, like bolding or different fonts):
171+
Here is what the text in the file `data/can_lang.csv` looks like.
174172

175173
```code
176174
category,language,mother_tongue,most_at_home,most_at_work,lang_known
@@ -209,6 +207,9 @@ canlang_data <- read_csv("data/can_lang.csv")
209207
> future when we use this and related functions to load data in this book, we will
210208
> silence these messages to help with the readability of the book.
211209
210+
Finally, to view the first 10 rows of the data frame,
211+
we must call it:
212+
212213
```{r view-data}
213214
canlang_data
214215
```
@@ -300,7 +301,7 @@ Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670
300301

301302
To read in this type of data, we can use the `read_tsv`
302303
\index{tab-separated values|see{tsv}}\index{tsv}\index{read function!read\_tsv}
303-
to read in .tsv (**t**ab **s**eparated **v**alues) files.
304+
to read in `.tsv` (**t**ab **s**eparated **v**alues) files.
304305

305306
```{r 01-read-tab}
306307
canlang_data <- read_tsv("data/can_lang_tab.tsv")
@@ -348,7 +349,7 @@ specify that there are no column names to assign, and give it the value of
348349
> **Note:** `\t` is an example of an *escaped character*,
349350
> which always starts with a backslash (`\`). \index{escape character}
350351
> Escaped characters are used to represent non-printing characters
351-
> (like the tab) or characters with special meanings (such as quotation marks).
352+
> (like the tab) or those with special meanings (such as quotation marks).
352353
353354
```{r}
354355
canlang_data <- read_delim("data/can_lang.tsv",
@@ -622,6 +623,7 @@ or `tail` to preview the last six rows of a data frame:
622623
```{r, eval = FALSE}
623624
tail(aboriginal_lang_db)
624625
```
626+
625627
```
626628
## Error: tail() is not supported by sql sources
627629
```
@@ -766,7 +768,7 @@ Opening a database \index{database!reasons to use} stored in a `.db` file
766768
involved a lot more effort than just opening a `.csv`, `.tsv`, or any of the
767769
other plain text or Excel formats. It was a bit of a pain to use a database in
768770
that setting since we had to use `dbplyr` to translate `tidyverse`-like
769-
commands (`filter`, `select`, `head`, etc.) into SQL commands that the database
771+
commands (`filter`, `select` etc.) into SQL commands that the database
770772
understands. Not all `tidyverse` commands can currently be translated with
771773
SQLite databases. For example, we can compute a mean with an SQLite database
772774
but can't easily compute a median. So you might be wondering: why should we use
@@ -1030,7 +1032,7 @@ td:nth-child(7),
10301032
```
10311033
10321034
Now that we have the CSS selectors that describe the properties of the elements
1033-
that we want to target (e.g., has a tag name `price`), we can use them to find
1035+
that we want to target, we can use them to find
10341036
certain elements in web pages and extract data.
10351037
10361038
**Using `rvest`**
@@ -1080,7 +1082,7 @@ as the second argument of `html_nodes`:
10801082
selectors <- paste("td:nth-child(5)",
10811083
"td:nth-child(7)",
10821084
".infobox:nth-child(122) td:nth-child(1)",
1083-
".infobox td:nth-child(3)", sep=",")
1085+
".infobox td:nth-child(3)", sep = ",")
10841086
10851087
population_nodes <- html_nodes(page, selectors)
10861088
head(population_nodes)
@@ -1090,6 +1092,14 @@ head(population_nodes)
10901092
print_html_nodes(head(population_nodes))
10911093
```
10921094
1095+
> **Note:** `head` is a function that is often useful for viewing only a short
1096+
> summary of an R object, rather than the whole thing (which may be quite a lot
1097+
> to look at). For example, here `head` shows us only the first 6 items in the
1098+
> `population_nodes` object. Note that some R objects by default print only a
1099+
> small summary. For example, `tibble` data frames only show you the first 10 rows.
1100+
> But not *all* R objects do this, and that's where the `head` function helps
1101+
> summarize things for you.
1102+
10931103
Next we extract the meaningful data&mdash;in other words, we get rid of the HTML code syntax and tags&mdash;from
10941104
the nodes using the `html_text`
10951105
function. In the case of the example
@@ -1137,7 +1147,7 @@ library(rtweet)
11371147
This package provides an extensive set of functions to search
11381148
Twitter for tweets, users, their followers, and more.
11391149
Let's construct a small data set of the last 400 tweets and
1140-
retweets from the \@tidyverse account. A few of the most recent tweets
1150+
retweets from the [\@tidyverse](https://twitter.com/tidyverse) account. A few of the most recent tweets
11411151
are shown in Figure \@ref(fig:01-tidyverse-twitter).
11421152
11431153
```{r 01-tidyverse-twitter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The tidyverse account Twitter feed.", fig.retina = 2, out.width="100%"}
@@ -1162,7 +1172,7 @@ we should abide by when using the API.
11621172
**Using `rtweet`**
11631173
11641174
After checking the Twitter website, it seems like asking for 400 tweets one time is acceptable.
1165-
So we can use the `get_timelines` function to ask for the last 400 tweets from the \@tidyverse account.
1175+
So we can use the `get_timelines` function to ask for the last 400 tweets from the [\@tidyverse](https://twitter.com/tidyverse) account.
11661176
11671177
```r
11681178
tidyverse_tweets <- get_timelines('tidyverse', n=400)
@@ -1197,7 +1207,7 @@ With the authentication setup out of the way, let's run the `get_timelines` func
11971207
the API and take a look at what was returned:
11981208
11991209
```r
1200-
tidyverse_tweets <- get_timelines('tidyverse', n=400)
1210+
tidyverse_tweets <- get_timelines('tidyverse', n = 400)
12011211
tidyverse_tweets
12021212
```
12031213
@@ -1221,7 +1231,7 @@ tidyverse_tweets <- select(tidyverse_tweets,
12211231
tidyverse_tweets
12221232
```
12231233
1224-
If you look back up at the image of the \@tidyverse Twitter page, you will
1234+
If you look back up at the image of the [\@tidyverse](https://twitter.com/tidyverse) Twitter page, you will
12251235
recognize the text of the most recent few tweets in the above data frame. In
12261236
other words, we have successfully created a small data set using the Twitter
12271237
API&mdash;neat! This data is also quite different from what we obtained from web scraping;

viz.Rmd

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ alternative.
143143
## Refining the visualization
144144
#### *Convey the message, minimize noise* {-}
145145

146-
Just being able to make a visualization in R with `ggplot2` (or any other tool
146+
Just being able to make a visualization in R (or any other language,
147147
for that matter) doesn't mean that it effectively communicates your message to
148148
others. Once you have selected a broad type of visualization to use, you will
149149
have to refine it to suit your particular need. Some rules of thumb for doing
@@ -185,8 +185,11 @@ understand and remember your message quickly.
185185
## Creating visualizations with `ggplot2`
186186
#### *Build the visualization iteratively* {-}
187187

188-
This section will cover examples of how to choose and refine a visualization given a data set and a question that you want to answer,
189-
and then how to create the visualization in R \index{ggplot} using `ggplot2`. To use the `ggplot2` package, we need to load the `tidyverse` metapackage.
188+
This section will cover examples of how to choose and refine a visualization
189+
given a data set and a question that you want to answer, and then how to create
190+
the visualization in R \index{ggplot} using the `ggplot2` R package. Given that
191+
the `ggplot2`package is loaded by the `tidyverse` metapackage, we still
192+
need to load only `tidyverse':
190193

191194
```{r 03-tidyverse, warning=FALSE, message=FALSE}
192195
library(tidyverse)
@@ -479,7 +482,8 @@ labels and make the font more readable:
479482
```{r 03-data-faithful-scatter-2, warning=FALSE, message=FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of waiting time and eruption time with clearer axes and labels."}
480483
faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
481484
geom_point() +
482-
labs(x = "Waiting Time (mins)", y = "Eruption Duration (mins)") +
485+
xlab("Waiting Time (mins)") +
486+
ylab("Eruption Duration (mins)") +
483487
theme(text = element_text(size = 12))
484488
485489
faithful_scatter
@@ -529,8 +533,8 @@ improve readability.
529533
```{r 03-mother-tongue-vs-most-at-home-labs, fig.height=3.5, fig.width=3.75, fig.align = "center", warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with x and y labels."}
530534
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
531535
geom_point() +
532-
labs(x = "Language spoken most at home \n (number of Canadian residents)",
533-
y = "Mother tongue \n (number of Canadian residents)") +
536+
xlab("Language spoken most at home \n (number of Canadian residents)") +
537+
ylab("Mother tongue \n (number of Canadian residents)") +
534538
theme(text = element_text(size = 12))
535539
```
536540

@@ -596,8 +600,8 @@ library(scales)
596600
597601
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
598602
geom_point() +
599-
labs(x = "Language spoken most at home \n (number of Canadian residents)",
600-
y = "Mother tongue \n (number of Canadian residents)") +
603+
xlab("Language spoken most at home \n (number of Canadian residents)") +
604+
ylab("Mother tongue \n (number of Canadian residents)") +
601605
theme(text = element_text(size = 12)) +
602606
scale_x_log10(labels = label_comma()) +
603607
scale_y_log10(labels = label_comma())
@@ -651,8 +655,8 @@ the final result.
651655
```{r 03-mother-tongue-vs-most-at-home-scale-props, fig.height=3.5, fig.width=3.75, fig.align = "center", warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home."}
652656
ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent)) +
653657
geom_point() +
654-
labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
655-
y = "Mother tongue \n (percentage of Canadian residents)") +
658+
xlab("Language spoken most at home \n (percentage of Canadian residents)") +
659+
ylab("Mother tongue \n (percentage of Canadian residents)") +
656660
theme(text = element_text(size = 12)) +
657661
scale_x_log10(labels = comma) +
658662
scale_y_log10(labels = comma)
@@ -710,8 +714,8 @@ ggplot(can_lang, aes(x = most_at_home_percent,
710714
y = mother_tongue_percent,
711715
color = category)) +
712716
geom_point() +
713-
labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
714-
y = "Mother tongue \n (percentage of Canadian residents)") +
717+
xlab("Language spoken most at home \n (percentage of Canadian residents)") +
718+
ylab("Mother tongue \n (percentage of Canadian residents)") +
715719
theme(text = element_text(size = 12)) +
716720
scale_x_log10(labels = comma) +
717721
scale_y_log10(labels = comma)
@@ -736,8 +740,8 @@ ggplot(can_lang, aes(x = most_at_home_percent,
736740
y = mother_tongue_percent,
737741
color = category)) +
738742
geom_point() +
739-
labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
740-
y = "Mother tongue \n (percentage of Canadian residents)") +
743+
xlab("Language spoken most at home \n (percentage of Canadian residents)") +
744+
ylab("Mother tongue \n (percentage of Canadian residents)") +
741745
theme(text = element_text(size = 12),
742746
legend.position = "top",
743747
legend.direction = "vertical") +
@@ -783,8 +787,8 @@ ggplot(can_lang, aes(x = most_at_home_percent,
783787
color = category,
784788
shape = category)) +
785789
geom_point() +
786-
labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
787-
y = "Mother tongue \n (percentage of Canadian residents)") +
790+
xlab("Language spoken most at home \n (percentage of Canadian residents)") +
791+
ylab("Mother tongue \n (percentage of Canadian residents)") +
788792
theme(text = element_text(size = 12),
789793
legend.position = "top",
790794
legend.direction = "vertical") +
@@ -1087,7 +1091,7 @@ instead of stacked bars
10871091
(which is the default for bar plots or histograms
10881092
when they are colored by another categorical variable).
10891093

1090-
```{r 03-data-morley-hist-3, warning=FALSE, message=FALSE, fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of Michelson's speed of light data colored by experiment."}
1094+
```{r 03-data-morley-hist-3, warning=FALSE, message=FALSE, fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of Michelson's speed of light data where an attempt is made to color the bars by experiment."}
10911095
morley_hist <- ggplot(morley, aes(x = Speed, fill = Expt)) +
10921096
geom_histogram(alpha = 0.5, position = "identity") +
10931097
geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
@@ -1500,7 +1504,7 @@ JPG format is twice as large as the PNG format since the JPG compression
15001504
algorithm is designed for natural images (not plots).
15011505

15021506
In Figure \@ref(fig:03-raster-image), we also show what
1503-
the images look like when we zoom in to a rectangle with only 3 data points.
1507+
the images look like when we zoom in to a rectangle with only 2 data points.
15041508
You can see why vector graphics formats are so useful: because they're just
15051509
based on mathematical formulas, vector graphics can be scaled up to arbitrary
15061510
sizes. This makes them great for presentation media of all sizes, from papers

0 commit comments

Comments
 (0)