UBC-DSCI
diff --git a/‎img/pivot_functions.key
-508 Bytes b/‎img/pivot_functions.key
-508 Bytes
diff --git a/‎img/pivot_functions/pivot_functions.002.jpeg
96 Bytes b/‎img/pivot_functions/pivot_functions.002.jpeg
96 Bytes
diff --git a/‎intro.Rmd
Lines changed: 4 additions & 4 deletions b/‎intro.Rmd
Lines changed: 4 additions & 4 deletions
diff --git a/‎reading.Rmd
Lines changed: 36 additions & 26 deletions b/‎reading.Rmd
Lines changed: 36 additions & 26 deletions
diff --git a/‎viz.Rmd
Lines changed: 22 additions & 18 deletions b/‎viz.Rmd
Lines changed: 22 additions & 18 deletions
@@ -25,10 +25,10 @@ By the end of the chapter, readers will be able to do the following:
 - Identify the different types of data analysis question and categorize a question into the correct type.
 - Load the `tidyverse` package into R.
 - Read tabular data with `read_csv`.
-- Use `?` to access help and documentation tools in R.
 - Create new variables and objects in R using the assignment symbol.
 - Create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`.
 - Visualize data with a `ggplot` bar plot.
+- Use `?` to access help and documentation tools in R.
 
 ## Canadian languages data set
 
@@ -312,7 +312,7 @@ to be surrounded by quotes.
 After making the assignment, we can use the special name words we have created in
 place of their values. For example, if we want to do something with the value `3` later on, 
 we can just use `my_number` instead. Let's try adding 2 to `my_number`; you will see that
-R just interprets this as adding 2 and 3:
+R just interprets this as adding 3 and 2:
 ```{r naming-things2}
 my_number + 2
 ```
@@ -374,7 +374,7 @@ Aboriginal languages in the data set, and then use `select` to obtain only the
 columns we want to include in our table.
 
 ### Using `filter` to extract rows
-Looking at the `can_lang` data above, we see the column `category` contains different
+Looking at the `can_lang` data above, we see the `category` column contains different
 high-level categories of languages, which include "Aboriginal languages",
 "Non-Official & Non-Aboriginal languages" and "Official languages".  To answer
 our question we want to filter our data set so we restrict our attention 
@@ -528,7 +528,7 @@ image_read("img/ggplot_function.jpeg") |>
   image_crop("1625x1900")
 ```
 
-```{r barplot-mother-tongue, fig.width=5, fig.height=3, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
+```{r barplot-mother-tongue, fig.width=5, fig.height=3.1, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
 ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
   geom_bar(stat = "identity")
 ```
 
@@ -110,11 +110,12 @@ happy_data <- read_csv("data/happiness_report.csv")
 happy_data <- read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
 ```
 
-So which one should you use? Generally speaking, to ensure your code can be run 
-on a different computer, you should use relative paths. An added bonus is that 
-it's also less typing! Generally, you should use relative paths because the file's 
-absolute path (the names of 
-folders between the computer's root `/` and the file) isn't usually the same 
+So which one should you use? Generally speaking, you should use relative paths. 
+Using a relative path helps ensure that your code can be run 
+on a different computer (and as an added bonus, relative paths are often shorter&mdash;easier to type!).
+This is because a file's relative path is often the same across different computers, while a
+file's absolute path (the names of 
+all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same 
 across different computers. For example, suppose Fatima and Jayden are working on a 
 project together on the `happiness_report.csv` data. Fatima's file is stored at 
 
@@ -130,18 +131,17 @@ their different usernames.  If Jayden has code that loads the
 `happiness_report.csv` data using an absolute path, the code won't work on
 Fatima's computer.  But the relative path from inside the `project` folder
 (`data/happiness_report.csv`) is the same on both computers; any code that uses
-relative paths will work on both!
-
-In the additional resources section, we include a link to a short video on the
+relative paths will work on both! In the additional resources section, 
+we include a link to a short video on the
 difference between absolute and relative paths. You can also check out the
 `here` package, which provides methods for finding and constructing file paths
 in R.  
 
-Your file could be stored locally, as we discussed, or it could also be
-somewhere on the internet (remotely). A *Uniform Resource Locator (URL)* (web
-address) \index{URL} indicates the location of a resource on the internet and
-helps us retrieve that resource. Next, we will discuss how to get either
-locally or remotely stored data into R. 
+Beyond files stored on your computer (i.e., locally), we also need a way to locate resources
+stored elsewhere on the internet (i.e., remotely). For this purpose we use a *Uniform Resource Locator (URL)*,
+i.e., a web address that looks something like https://datasciencebook.ca/. \index{URL}
+URLs indicate the location of a resource on the internet and
+help us retrieve that resource. 
 
 ## Reading tabular data from a plain text file into R
 
@@ -152,7 +152,7 @@ to import data into R using various functions. Specifically, we will learn how
 to *read* tabular data from a plain text file (a document containing only text)
 *into* R and *write* tabular data to a file *out of* R. The function we use to do this
 depends on the file's format. For example, in the last chapter, we learned about using
-the `tidyverse` `read_csv` function when reading .csv (**c**omma-**s**eparated **v**alues)
+the `tidyverse` `read_csv` function when reading `.csv` (**c**omma-**s**eparated **v**alues)
 files. \index{csv} In that case, the separator or *delimiter* \index{reading!delimiter} that divided our columns was a
 comma (`,`). We only learned the case where the data matched the expected defaults 
 of the `read_csv` function \index{read function!read\_csv}
@@ -168,9 +168,7 @@ language data from the 2016 Canadian census. \index{Canadian languages!canlang d
 We put `data/` before the file's
 name when we are loading the data set because this data set is located in a
 sub-folder, named `data`, relative to where we are running our R code.
-
-Here is what the file would look like in a plain text editor (a program that removes
-all formatting, like bolding or different fonts):
+Here is what the text in the file `data/can_lang.csv` looks like.
 
 ```code
 category,language,mother_tongue,most_at_home,most_at_work,lang_known
@@ -209,6 +207,9 @@ canlang_data <- read_csv("data/can_lang.csv")
 > future when we use this and related functions to load data in this book, we will
 > silence these messages to help with the readability of the book.
 
+Finally, to view the first 10 rows of the data frame,
+we must call it:
+
 ```{r view-data}
 canlang_data
 ```
@@ -300,7 +301,7 @@ Non-Official & Non-Aboriginal languages Amharic 22465   12785   200 33670
 
 To read in this type of data, we can use the `read_tsv` 
 \index{tab-separated values|see{tsv}}\index{tsv}\index{read function!read\_tsv}
-to read in .tsv (**t**ab **s**eparated **v**alues) files. 
+to read in `.tsv` (**t**ab **s**eparated **v**alues) files. 
 
 ```{r 01-read-tab}
 canlang_data <- read_tsv("data/can_lang_tab.tsv")
@@ -348,7 +349,7 @@ specify that there are no column names to assign, and give it the value of
 > **Note:** `\t` is an example of an *escaped character*, 
 > which always starts with a backslash (`\`). \index{escape character}
 > Escaped characters are used to represent non-printing characters 
-> (like the tab) or characters with special meanings (such as quotation marks). 
+> (like the tab) or those with special meanings (such as quotation marks). 
 
 ```{r}
 canlang_data <- read_delim("data/can_lang.tsv", 
@@ -622,6 +623,7 @@ or `tail` to preview the last six rows of a data frame:
 ```{r, eval = FALSE}
 tail(aboriginal_lang_db)
 ```
+
 ```
 ## Error: tail() is not supported by sql sources
 ```
@@ -766,7 +768,7 @@ Opening a database \index{database!reasons to use} stored in a `.db` file
 involved a lot more effort than just opening a `.csv`, `.tsv`, or any of the
 other plain text or Excel formats. It was a bit of a pain to use a database in
 that setting since we had to use `dbplyr` to translate `tidyverse`-like
-commands (`filter`, `select`, `head`, etc.) into SQL commands that the database
+commands (`filter`, `select` etc.) into SQL commands that the database
 understands. Not all `tidyverse` commands can currently be translated with
 SQLite databases. For example, we can compute a mean with an SQLite database
 but can't easily compute a median. So you might be wondering: why should we use
@@ -1030,7 +1032,7 @@ td:nth-child(7),
 ```
 
 Now that we have the CSS selectors that describe the properties of the elements
-that we want to target (e.g., has a tag name `price`), we can use them to find
+that we want to target, we can use them to find
 certain elements in web pages and extract data. 
 
 **Using `rvest`**
@@ -1080,7 +1082,7 @@ as the second argument of `html_nodes`:
 selectors <- paste("td:nth-child(5)",
              "td:nth-child(7)",
              ".infobox:nth-child(122) td:nth-child(1)",
-             ".infobox td:nth-child(3)", sep=",")
+             ".infobox td:nth-child(3)", sep = ",")
 
 population_nodes <- html_nodes(page, selectors)
 head(population_nodes)
@@ -1090,6 +1092,14 @@ head(population_nodes)
 print_html_nodes(head(population_nodes))
 ```
 
+> **Note:** `head` is a function that is often useful for viewing only a short
+> summary of an R object, rather than the whole thing (which may be quite a lot
+> to look at). For example, here `head` shows us only the first 6 items in the
+> `population_nodes` object. Note that some R objects by default print only a
+> small summary. For example, `tibble` data frames only show you the first 10 rows.
+> But not *all* R objects do this, and that's where the `head` function helps
+> summarize things for you.
+
 Next we extract the meaningful data&mdash;in other words, we get rid of the HTML code syntax and tags&mdash;from 
 the nodes using the `html_text`
 function. In the case of the example
@@ -1137,7 +1147,7 @@ library(rtweet)
 This package provides an extensive set of functions to search 
 Twitter for tweets, users, their followers, and more. 
 Let's construct a small data set of the last 400 tweets and 
-retweets from the \@tidyverse account. A few of the most recent tweets
+retweets from the [\@tidyverse](https://twitter.com/tidyverse) account. A few of the most recent tweets
 are shown in Figure \@ref(fig:01-tidyverse-twitter).
 
 ```{r 01-tidyverse-twitter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The tidyverse account Twitter feed.", fig.retina = 2, out.width="100%"}
@@ -1162,7 +1172,7 @@ we should abide by when using the API.
 **Using `rtweet`**
 
 After checking the Twitter website, it seems like asking for 400 tweets one time is acceptable.
-So we can use the `get_timelines` function to ask for the last 400 tweets from the \@tidyverse account.
+So we can use the `get_timelines` function to ask for the last 400 tweets from the [\@tidyverse](https://twitter.com/tidyverse) account.
 
 ```r
 tidyverse_tweets <- get_timelines('tidyverse', n=400)
@@ -1197,7 +1207,7 @@ With the authentication setup out of the way, let's run the `get_timelines` func
 the API and take a look at what was returned:
 
 ```r
-tidyverse_tweets <- get_timelines('tidyverse', n=400)
+tidyverse_tweets <- get_timelines('tidyverse', n = 400)
 tidyverse_tweets
 ```
 
@@ -1221,7 +1231,7 @@ tidyverse_tweets <- select(tidyverse_tweets,
 tidyverse_tweets
 ```
 
-If you look back up at the image of the \@tidyverse Twitter page, you will
+If you look back up at the image of the [\@tidyverse](https://twitter.com/tidyverse) Twitter page, you will
 recognize the text of the most recent few tweets in the above data frame.  In
 other words, we have successfully created a small data set using the Twitter
 API&mdash;neat! This data is also quite different from what we obtained from web scraping;
 
@@ -143,7 +143,7 @@ alternative.
 ## Refining the visualization
 #### *Convey the message, minimize noise* {-}
 
-Just being able to make a visualization in R with `ggplot2` (or any other tool
+Just being able to make a visualization in R (or any other language,
 for that matter) doesn't mean that it effectively communicates your message to
 others. Once you have selected a broad type of visualization to use, you will
 have to refine it to suit your particular need.  Some rules of thumb for doing
@@ -185,8 +185,11 @@ understand and remember your message quickly.
 ## Creating visualizations with `ggplot2` 
 #### *Build the visualization iteratively* {-}
 
-This section will cover examples of how to choose and refine a visualization given a data set and a question that you want to answer, 
-and then how to create the visualization in R \index{ggplot} using `ggplot2`.  To use the `ggplot2` package, we need to load the `tidyverse` metapackage.
+This section will cover examples of how to choose and refine a visualization
+given a data set and a question that you want to answer, and then how to create
+the visualization in R \index{ggplot} using the `ggplot2` R package. Given that
+the `ggplot2`package is loaded by the `tidyverse` metapackage, we still 
+need to load only `tidyverse':
 
 ```{r 03-tidyverse, warning=FALSE, message=FALSE}
 library(tidyverse)
@@ -479,7 +482,8 @@ labels and make the font more readable:
 ```{r 03-data-faithful-scatter-2, warning=FALSE, message=FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center",  fig.pos = "H", out.extra="", fig.cap = "Scatter plot of waiting time and eruption time with clearer axes and labels."}
 faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
   geom_point() +
-  labs(x = "Waiting Time (mins)", y = "Eruption Duration (mins)") +
+  xlab("Waiting Time (mins)") + 
+  ylab("Eruption Duration (mins)") +
   theme(text = element_text(size = 12))
 
 faithful_scatter
@@ -529,8 +533,8 @@ improve readability.
 ```{r 03-mother-tongue-vs-most-at-home-labs, fig.height=3.5, fig.width=3.75, fig.align = "center", warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with x and y labels."}
 ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
   geom_point() +
-  labs(x = "Language spoken most at home \n (number of Canadian residents)",
-       y = "Mother tongue \n (number of Canadian residents)") +
+  xlab("Language spoken most at home \n (number of Canadian residents)") +
+  ylab("Mother tongue \n (number of Canadian residents)") +
   theme(text = element_text(size = 12))
 ```
 
@@ -596,8 +600,8 @@ library(scales)
 
 ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
   geom_point() +
-  labs(x = "Language spoken most at home \n (number of Canadian residents)",
-       y = "Mother tongue \n (number of Canadian residents)") +
+  xlab("Language spoken most at home \n (number of Canadian residents)") +
+  ylab("Mother tongue \n (number of Canadian residents)") +
   theme(text = element_text(size = 12)) +
   scale_x_log10(labels = label_comma()) +
   scale_y_log10(labels = label_comma())
@@ -651,8 +655,8 @@ the final result.
 ```{r 03-mother-tongue-vs-most-at-home-scale-props, fig.height=3.5,  fig.width=3.75, fig.align = "center",  warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home."}
 ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent)) +
   geom_point() +
-  labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
-       y = "Mother tongue \n (percentage of Canadian residents)") +
+  xlab("Language spoken most at home \n (percentage of Canadian residents)") +
+  ylab("Mother tongue \n (percentage of Canadian residents)") +
   theme(text = element_text(size = 12)) +
   scale_x_log10(labels = comma) +
   scale_y_log10(labels = comma)
@@ -710,8 +714,8 @@ ggplot(can_lang, aes(x = most_at_home_percent,
                      y = mother_tongue_percent, 
                      color = category)) +
   geom_point() +
-  labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
-       y = "Mother tongue \n (percentage of Canadian residents)") +
+  xlab("Language spoken most at home \n (percentage of Canadian residents)") +
+  ylab("Mother tongue \n (percentage of Canadian residents)") +
   theme(text = element_text(size = 12)) +
   scale_x_log10(labels = comma) +
   scale_y_log10(labels = comma)
@@ -736,8 +740,8 @@ ggplot(can_lang, aes(x = most_at_home_percent,
                      y = mother_tongue_percent, 
                      color = category)) +
   geom_point() +
-  labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
-       y = "Mother tongue \n (percentage of Canadian residents)") +
+  xlab("Language spoken most at home \n (percentage of Canadian residents)") +
+  ylab("Mother tongue \n (percentage of Canadian residents)") +
   theme(text = element_text(size = 12),
         legend.position = "top",
         legend.direction = "vertical") +
@@ -783,8 +787,8 @@ ggplot(can_lang, aes(x = most_at_home_percent,
                      color = category, 
                      shape = category)) +
   geom_point() +
-  labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
-       y = "Mother tongue \n (percentage of Canadian residents)") +
+  xlab("Language spoken most at home \n (percentage of Canadian residents)") +
+  ylab("Mother tongue \n (percentage of Canadian residents)") +
   theme(text = element_text(size = 12),
         legend.position = "top",
         legend.direction = "vertical") +
@@ -1087,7 +1091,7 @@ instead of stacked bars
 (which is the default for bar plots or histograms 
 when they are colored by another categorical variable).
 
-```{r 03-data-morley-hist-3, warning=FALSE, message=FALSE,  fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of Michelson's speed of light data colored by experiment."}
+```{r 03-data-morley-hist-3, warning=FALSE, message=FALSE,  fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of Michelson's speed of light data where an attempt is made to color the bars by experiment."}
 morley_hist <- ggplot(morley, aes(x = Speed, fill = Expt)) +
   geom_histogram(alpha = 0.5, position = "identity") +
   geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
@@ -1500,7 +1504,7 @@ JPG format is twice as large as the PNG format since the JPG compression
 algorithm is designed for natural images (not plots). 
 
 In Figure \@ref(fig:03-raster-image), we also show what
-the images look like when we zoom in to a rectangle with only 3 data points.
+the images look like when we zoom in to a rectangle with only 2 data points.
 You can see why vector graphics formats are so useful: because they're just
 based on mathematical formulas, vector graphics can be scaled up to arbitrary
 sizes.  This makes them great for presentation media of all sizes, from papers